Tangoing with nonsymmetric contraction matrices: Understanding Gelfand’s formula and its implications on linear convergence rates

I have finally reached the point in my life when I have the misfortune of meeting not one, but two nonsymmetric “contraction” matrices, and have had to try to understand why it is that, even after all the hard work of proving that a matrix’s largest eigenvalue is $\lambda_{\max}\leq \rho <1$, it does not mean that we have a contraction rate of $\rho^k$. This journey includes not just a lot of bruises from banging one’s head against a wall, but also a (reunion) encounter with degenerate eigenvalues, algebraic vs geometric multiplicity, and the ever-present linear algebra embodiment of “speak softly and carry a big stick”, the Jordan canonical form.

(Apologies for rambling language. I haven’t had much sleep.)

Ok, so here is the problem. Let’s say I have a method, and through thick and thin I have finally proven that this method can be bounded in terms of something like

x^{(k)}-x^* = A_k (x^{(k-1)}-x^*) = (\prod_{i\leq k} A_i)(x^{(0)}-x^*).\tag{Method}

Then, someone way smarter than me has also shown that the largest eigenvalue of $A_k$, in modulus, is also $\rho <1$, for all $k$. Great! we say. We can conclude that $\|x^{(k)}-x^*\|_*\leq O(\rho^k)$! And to add insult to injury, all our simulations actually show this rate! So we’re done right?

Wait. No. $A_k$ is not symmetric. So, while it is true that

\|x^{(k)}-x^*\|_2 = \| (\prod_{i\leq k} A_i)(x^{(0)}-x^*)\|_2\leq \|x^{(0)}-x^*\|_2,\tag{LinRate}

we didn’t actually show that $\max_{i\leq k} \|A_i\|_2^k\leq \rho < 1$ (or that any $\|A_i\|_2\leq \rho$).  So, now what do we do?

First approach: Gelfand’s formula.

Through some googling, we usually can find this formula:

|\lambda_{\max}| = \lim_{k\to\infty}\|A^k\|_2^{1/k}

or, more specifically,

|\lambda_{\max}| \geq  \lim_{k\to\infty}\|A^k\|_2^{1/k} – \epsilon_k, \qquad \epsilon_k\to 0.

So, we can always just stop there and say that the rate (LinRate) is asymptotically true.  This is somewhat unsatisfying, first because asymptotic rates are a bit wussy; but secondly, because we just kind of used some old guy’s rule without really understanding why it may be true.

Second approach: Complex diagonalization

It turns out that the issue we were facing above was not that $A_k$ is not symmetric, but that it is not necessarily diagonalizable. But, in cases where $A_k$ are diagonalizable, we actually don’t have that much to worry about.

Take, for example, the contraction matrix associated with the Polyak Heavy Ball method:

A_k = \begin{bmatrix} (1+\beta) I -\eta H_k & -\beta I \\ I & 0 \end{bmatrix}

and Polyak’s method can indeed be written like (Method) where $H_k = \nabla^2 f(x^{(k)})$. When the problem is quadratic, then $H_k$ is constant, but otherwise it may change with respect to $k$; but, under the right smoothness and strong convexity conditions, its eigenvalues will be bounded. Overall, with some smarts, we can show that $\lambda_{\max}(A_k) \leq \sqrt{\beta}$, and we really would like to say that this means $\|x^{(k)}-x^*\|_2 = O(\sqrt{\beta}^k)$. (This part of the analysis is well-known, was first shown by Polyak in [1], and actually I had discussed it in an earlier blog post.)

Now, suppose that $H_k$ is not constant, but satisfies the spectral bounds in our assumptions. Now, we are following the analysis in [2], which is cute, simple, and pretty powerful. Basically, we first observe that $H_k$ is symmetric and diagonalizable, and thus we may write $A_k$ :

H_k = U_k\Sigma_kU_k^T, \qquad A_k = \begin{bmatrix} U_k & 0 \\ 0 & U_k \end{bmatrix}\underbrace{ \begin{bmatrix}(1+\beta)I-\eta \Sigma_k & -\beta I \\ I & 0 \end{bmatrix} }_{T_k}\begin{bmatrix} U^T_k & 0 \\ 0 & U^T_k \end{bmatrix}

Furthermore, note that a permutation exists such that $\Pi T_k\Pi^T$ is block diagonal, with each block as the 2×2 matrix

(\Pi T_k\Pi^T )_{2i-1:2i}= \begin{bmatrix}(1+\beta)-\eta (\Sigma_k)_i & -\beta 1 \\ 1 & 0 \end{bmatrix}

and importantly, this matrix is diagonalizable! So, using the right combination of permutation matrices, we will be able to diagonalize $T_k = Q_kZ_kQ_k^{-1}$ where $Z_k$ is diagonal and $\sqrt{\beta} \geq \max_k |(Z_k)_{jj}|$. Note also that both $Q$ and $Z$ ought to have complex values (but that in itself is not problematic.)

Then, the rest is not too tricky. In the paper [2], they define $P_k = \begin{bmatrix} U_k & 0 \\ 0 & U_k \end{bmatrix}Q_k$ so that $A_k = P_kZ_kP_k^T$, and introduce a dummy variable $u_k = P_k^{-1}(x^{(k)}-x^*)$ so that

u_k = P_k^{-1}A_k P_k u_{k-1}= Z_k u_{k-1} = \prod_{i\leq k}Z_iu_0
and since $Z_i$ is diagonal with largest modulus value $\leq \sqrt{\beta}$, it is not hard to say $\|u_k\|_2\leq \sqrt{\beta}^k u_0$. In the paper they actually use an additional step to say $\|x_k-x^*\|_2\leq \kappa(P)\|u_k\|_2$ which adds a constant factor, but actually $P_k$ looks like it ought to be well conditioned in practice.

But, note that the hard part of this proof was to show that $T_k$ is diagonalizable (and therefore $A_k$ is diagonalizable) despite not being symmetric. Now, we have to ask ourselves, what happens if $A_k$ is not diagonalizable?

Third approach: try to remember some useful linear algebra facts.

Sometimes in the midst of some serious REM sleep, I’ll have reoccuring dreams of old math lectures. In one of them is a friendly graduate school professor chanting, in a comforting mantra: “not all matrices are diagonalizable… ” I then try to focus the memory a bit, to skip ahead a bit to the next part of the lecture, which says “…but the Jordan canonical form (JCF) always exists!”

To remember what this JCF thing is, let’s take as an example

A = \begin{bmatrix} 1 & \alpha \\ 0  & 1 \end{bmatrix}.

If we then look at its characteristic polynomial, then $det(A-\lambda I) = 0$ only when $\lambda = 1$, so it has two identical roots. But, we can also directly compute that
A^k = \begin{bmatrix} 1 & k\alpha \\ 0  & 1 \end{bmatrix}.

so clearly, its spectral radius is not bounded in the limit $k\to \infty$! Therefore, in this case, it would be absolutely incorrect to say (LinRate), for $\rho = 1$. However, if we just scale down the diagonal of $A$ ever so slightly, then
A = \begin{bmatrix} \beta & k\alpha \\ 0  & \beta \end{bmatrix}, \qquad
A^k = \beta^k \begin{bmatrix} 1 & \frac{\alpha}{\beta} \\ 0  & 1 \end{bmatrix}^k
= \beta^k \begin{bmatrix} 1 & \frac{k\alpha}{\beta} \\ 0  & 1 \end{bmatrix}

and since $\beta^k$ shrinks faster than $k\alpha/\beta$, this quantity will go to 0, asymptotically as $\rho(\beta^k)$, but not necessarily upper bounded by $\beta^k$.

Connection wiht Jordan canonical form

The JCF decomposition of a matrix $A$ is

A = UJU^{-1}

where $\|U\|_2 = 1$ and $J = \mathrm{diag}(J_1,J_2,…,J_D)$ contains diagonal blocks

J_i = \begin{bmatrix} \lambda_i & 1 & 0 & … & 0 & 0 \\ 0 & \lambda_i & 1 & … &0 & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots
\\ 0 & 0 & 0 & …  & \lambda_i&1
\\ 0 & 0 & 0 & … & 0 & \lambda_i\end{bmatrix}

The blocksize of $J_i$  is the geometric  multiplicity of the corresponding $i$th eigenvalue. This could be any number; 1 means it is the eigenvalue you already know and love, but > 1 means that there are “hidden eigenvector spaces”, where in fact the matrix spans more spaces but we can’t really reach them using an eigenbasis. That is the case of the scary matrix we had found. Note that, taking any one of these diagonal blocks, we can get
=  \lambda_i^k\begin{bmatrix} 1& \binom{k}{1}\lambda_i^{-1} &  \binom{k}{2}\lambda_i^{-2} & … & \binom{k}{b_i}\lambda_i^{-b_i} & 0 \\ 0 & 1& \binom{k}{2}\lambda_i^{-2} & … & \binom{k}{b_i-2}\lambda_i^{-b_i-2} & \binom{k}{b_i-1}\lambda_i^{-b_i-1}\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots
\\ 0 & 0 & 0 & …  & 1& \binom{k}{1}\lambda_i^{-1}
\\ 0 & 0 & 0 & … & 0 &1\end{bmatrix}
More precisely, I can write $\binom{k}{b} \leq \frac{k^b}{b!}$, which guarantees that, for $\lambda_i\leq 1$,
\|J_i^k\|_\infty \leq \lambda_i^k \cdot  \frac{ k^{b_i}}{b_i!}
and therefore
\|J_i^k\|_2 = \max_{u\neq 0} \frac{\|J_i^k u\|_2}{\|u\|_2} \leq \lambda_i^k \cdot  \frac{ k^{b_i}}{b_i!} \cdot \underbrace{\max_{u\neq 0} \frac{|u\|_1 \|\mathbf{1}\|_2}{\|u\|_2}}_{b_i\sqrt{b_i}}   =  \lambda_i^k \cdot  \frac{\sqrt{b_i}  k^{b_i }}{(b_i-1)!}

which overall gives an overall nonasymptotic convergence rate of
\|x^{(k)}-x^*\|_2\leq \rho^k \max_i     \frac{\sqrt{b_i}  k^{b_i }}{(b_i-1)!}
where $b_i$ are the Jordan blocks of $A$!  (Nope, no constants!) Indeed  this is $\|J_i^k\|\lesssim \lambda_i^k$ (but not nonasymptotically).

Conclusions and future thoughts.

So at this point, we’ve concluded that indeed, it is not a good idea to bound the action of “contraction” matrices (in quotes because they may not necessarily contract) based on the behavior of the largest eigenvalue; however, we can also see qualitatively how eventually, the spectral radius does give a linear rate. Note that this analysis can be done by using the ever-forgotten always-powerful JCF, which very explicitly points out when eigenvalues are problematic. (Not from symmetry, but from degeneracy.)

Overall (edited),  we can actually come up with better nonasymptotic rates for nondiagonalizable contraction matrices, using the JCF. It feels a bit unsettling because it does not contain a “fully linear” overall trend, and depends critically on $b_i$, the order of the Jordan blocks, which may be hard to ascertain in practice. But, it is a nonasymptotic rate!

Well, hope y’all are enjoying your day!



[1]: Polyak, Boris T. “Introduction to optimization. optimization software.” Inc., Publications Division, New York 1 (1987): 32.

[2] Wang, Jun-Kun, Chi-Heng Lin, and Jacob D. Abernethy. “A modular analysis of provable acceleration via polyak’s momentum: Training a wide RELU network and a deep linear network.” In International Conference on Machine Learning, pp. 10816-10827. PMLR, 2021.


Many thanks to Sharan, Reza, and Baojian, whose discussions helped me understand this issue.



High dimensional mean value theorem.

Here’s a new “overthinking it” issue for you. For a while now, I’ve been relying on the high-dimensional version of a Taylor series approximation result to help me flip variables and gradients. That is, for 1-D functions with continuous Hessians everywhere,  the mean value theorem says that there always exists a $z$ where $x\leq z \leq y$, such that

g”(z) (x-y)=g'(x)-g'(y).

Without blinking, we use this to extend to $f:\mathbb R^n\to \mathbb R$, and say that there always exists some $z$ where

\nabla^2 f(z) (x-y) = \nabla f(x) – \nabla f(y).\tag{MVT-hd}

Never mind that the notion of “between $x$ and $y$” doesn’t really make sense. We just love the result, and use it to flip primal and dual solutions, based on the qualities of $\nabla^2 f(z)$, willy nilly.

But, one thing I am realizing as I go through my convex analysis adventure, is that everything that is true in high dimensions can be proven using only tools in low dimensions.  And by low dimensions, I mean 1-dimensional. So, I should be able to, using only 1-D function knowledge, prove (MVT-hd).

First, (MVT-hd) is true if

u_k^T\nabla^2 f(z) (x-y) =u_k^T( \nabla f(x) – \nabla f(y)).\tag{MVT-hd}

for some set of $u_1,u_2,…,u_n$ spanning $\mathbb R^n$. To make life easy, let’s pick the standard basis. Then,

e_k^T\nabla^2 f(z)(x-y) = \sum_{i=1}^n \frac{\partial f(u)^2}{\partial u_k\partial u_k}\Bigg|_{u=z} \cdot (x_i-y_i)


e_k^T(\nabla f(x)-\nabla f(y)) = \frac{\partial f(u)}{\partial u_k}\Bigg|_{u=x}- \frac{\partial f(u)}{\partial u_k}\Bigg|_{u=y}

Let us now define some dummy 1-D functions as
g_k(\alpha) = \frac{\partial f(u)}{\partial u_k}\Bigg|_{u=x+\alpha(y-x)}


e_k^T(\nabla f(x)-\nabla f(y))=g_k(1)-g_k(0)


g’_k(\alpha) = \sum_{j=1}^n \frac{\partial^2 f(u)}{\partial u_k\partial u_j}\Bigg|_{u=x+\alpha(y-x)}.\cdot (y_j-x_j)

Now this is very promising! because now we can invoke MVT on $g_k(\alpha)$ and say that there must exist some $0\leq \bar \alpha \leq 1$ where

g_k(1)-g_k(0)=g’_k(\bar\alpha)\iff e_k^T(\nabla f(x)-\nabla f(y)) = \sum_{j=1}^n \frac{\partial^2 f(u)}{\partial u_k\partial u_j}\Bigg|_{u=x+\bar \alpha(y-x)}\cdot (y_j-x_j).

However, there’s one catch: there’s no reason to assume that this $\bar\alpha$ is the same point for each $k$. So, while (MVT-hd) has one $z$ for its relation, here, we’ve got $n$ different $\bar \alpha_k$s, corresponding to each $g_k$.

Of course, you could always create a new function $g(\alpha) = \sum_k u_k g_k(\alpha)$, and following the same analysis, you could show that, for any $u$, there exists some $z$ where

u^T\nabla^2 f(z) (x-y) =u^T( \nabla f(x) – \nabla f(y)).\tag{MVT-hd-proj}

It turns out that this alone is enough for me to prove most of what I need. However, right now I am hesitant  to say that one unique $z$ exists for all $u$.

I hope to be proven wrong, but in the meantime, I shall begin my search for a counterexample!

We need to talk about Adam

At some point in every optimist’s life, we are going to be asked the age-old question: “How does your method compare with Adam?” or, even more annoyingly, “Does this analysis help understand Adam?” I mean, Adam is a nice method that works very well for a very unique set of optimization problems, and not, like, the answer to all the problems in the universe (42). That being said, even if Adam (the method) isn’t exactly Adam (the first man), we should probably know a little bit about what it’s about.

In case you were living under a rock for the past 10 years, Adam refers to ADAptive Moment estimation, this magical method that came out in 2015 by Kingma and Ba, and is widely hailed as the one training method that actually seems to work well. It was the latest in adaptive step size methods, starting with AdaGrad (Duchi, Hazan, Singer 2011), but including AdaDelta (Zeiler 2012), RMSProp (Tieleman & Hinton, 2012), etc.

Adam (Kingma and Ba, 2015)

(typo above: should be $\alpha_t = \alpha/\sqrt{t}$, not constant $\alpha$)

Not Adam Or Adam

Anyway, so I finally decided it was time to bite the bullet, and learn about what this Adam thing is all about, follow the journey to a proper convergence proof, and give some comments as to how it may work in practice.  This post will largely be a literature survey about two parts of this story: the original paper, and the followup proof

[1] Kingma and Ba. Adam: A Method for Stochastic Optimization

[2] Reddi, Kale, Kumar. On the Convergence of Adam and Beyond.

Related prior works

As previously mentioned, Adam comes as a member of the “adaptive step size committee”, a string of methods that followed AdaGrad, the often hailed seminal work of Duchi et al who first proposed adaptively adjusting the step size to help deep network training deal with the vanishing gradient problem. Followup work, like RMSProp, and Adam itself, can be seen as “descendants” of this work.  While that paper in itself holds an important place in this journey, in the interest of time I will not mention it much more here.

What the papers do vs what people actually do

Before going on, I need to point out, indeed, highlight, three assumptions (1-3) that are critical in the proofs but clearly not used in practice. Additionally, we will mention a bug (4), which was later fixed in [2], and just for completeness a minor thing to remember (5).

  1. The proof assumes $f_t$, the online loss function, is convex in its parameters.
  2. The proof uses a decaying step size $\alpha_t = \alpha/\sqrt{t}$. In practice, we use a constant $\alpha$.
  3. The proof uses a decaying mixing parameter $\beta_{1,t} = \beta_{1,0}\lambda^{t-1}$ for some $0 < \lambda  < 1$.
  4. (Bug, fixed in later work) We must require $\sqrt{v_t}/\alpha_t$ is a decreasing sequence.
  5. (Minor) the proof assumes $\epsilon = 0$. (In practice, $\epsilon \approx 0$.)

Now, I should do a disclaimer and say that everyone and their sister, including me, often resorts to unreasonable assumptions the night before the deadline. It’s unsatisfying when theory doesn’t quite match practice, but the truth is doing convergence proofs is hard, especially for very general assumptions. However, what makes me feel that this is a bit less ok than the usual case is that Adam is often hailed as the method to use ([1] is cited over 100,000 times). Therefore, the fact that we as a community don’t seem as interested in closing this loop is a bit baffling to me. For this reason, at the end of this post, I will try to highlight a few papers that tried to “fix the proof” by justifying these assumptions. If I have missed out a work in this category that you know of, please do let me know and I will update it.

(By the way, I want to clarify, this is not a rant against the authors of [1]. I am not trying to take away the significance of their contribution. It’s more of a question of whether, as a community, we are satisfied with this story being “done”.)

Proof of Adam ([1], with help from [2])

For $g_t = \nabla_\theta(f_t(\theta_{t-1}))$, consider the method

m_t &=& \beta_{1,t}m_{t-1} + (1-\beta_{1,t})g_t, \quad \hat m_t = \frac{m_t}{1-\beta_1^t}\\
v_t &=& \beta_{2}v_{t-1} + (1-\beta_{2})g^2_t, \quad \hat v_t = \frac{v_t}{1-\beta_2^t}\\
\theta_t &=& \theta_{t-1}\, -\, \frac{\alpha_t }{\sqrt{\hat v_t}}\hat m_t

where we initialize $m_0 = v_0 = 1$ and $\beta_{1,0} = \beta_1$. Here, $m_t,v_t,\theta_t,g_t$ are all vectors in $\mathbb R^d$ and with an extreme abuse of notation, all multiplications and divisions between these vectors are taken element-wise. The scalars are $\alpha_t,\beta_{1,t},\beta_2$, and of course $t$.

Theorem: Assume that

  • the functions $f_t(\theta) = \mathcal L_\theta(x_t,y_t)$ are convex in $\theta\in \mathbb R^d$
  • the gradients of $f_t$ are bounded everywhere $\|\nabla f_t(\theta)\|_2 \leq G$, $\|f_t(\theta)\|_\infty \leq G_\infty$
  • the distance between any two $\theta_t$ generated by Adam is bounded $$\|\theta_n-\theta_m\|_2\leq D, \quad \|\theta_m-\theta_n\|_\infty \leq D_\infty, \text{  for any   } m,n \in \{1,…,T\}$$

Pick parameters such that $\beta_1,\beta_2 \in [0,1)$, $\beta_1^2/\sqrt{\beta_2} < 1$, $\alpha_t = \alpha/\sqrt{t}$, $\lambda\in (0,1)$, and define $\beta_{1,t}=\beta_1\lambda^{t-1}$. Then for all $T\geq 1$, the regret $$R(T) = \sum_{t=1}^Tf_t(\theta_t)-\min_\theta \sum_{t=1}^T f_t(\theta)$$ is upper bounded
\frac{D^2}{2\alpha}\frac{ \sqrt{G}}{\sqrt{1-\beta_2}}
+ \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} \frac{\alpha\sqrt{\log(T)+1} }{2(1-\beta_1\lambda)} \|g_{1:T}\|_1 \\
+ \frac{1}{1-\beta_1} \frac{dG_\infty\sqrt{ \beta_2}}{\sqrt{ \beta_2}-\beta_1^2} + \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}(1-\lambda)^2}
In terms of big-O, the first, third, and fourth term are constants, and overall we get an $R(T) = O(\sqrt{\log(T)}\|g_{1:T}\|_1$ term. Here I should note that in the two papers [1] and [2], actually this is a $\|g_{1:T}\|_2$  term, but in recreating the proof I was not able to recover that. That being said, these norms are exchangable, so we’ll just leave it as is. The  $\|g_{1:T}\|_2$ also appears in AdaGrad, and the argument is that if your gradients are sparse enough, this guy is also $O(\sqrt{T})$. Seems like a rough assumption, but let’s go with the flow for now.

Additionally, it should be noted that there does not appear to be a significant speedup over SGD. In [1], the argument is that Adam is $O(\log(d)/\sqrt{t})$, (which got boosted in terms of $t$ by [2]), but in no way can I recreate the proof so that the dependency on $d$ is less than linear. (I’d be happy to correct it if people feel differently.)

Either way, the fact that Adam does not claim to have a faster convergence rate than SGD in terms of $t$, is somewhat significant.


Now let’s do the proofs

Using the update rule $\theta_{t+1}=\theta_t-\frac{\alpha_t\hat m_t}{\sqrt{\hat v_t}}$ where $\epsilon = 0$
then we may write

(\theta_{t+1,i}-\theta_i^*)^2&=&(\theta_{t,i}-\theta_i^*)^2 – 2\alpha_t\frac{\hat m_{t,i}}{\sqrt{\hat v_{t,i}}}(\theta_{t,i}-\theta_i^*) + \alpha^2_t\frac{\hat m^2_{t,i}}{\hat v_{t,i}}\\
&=&(\theta_{t,i}-\theta_i^*)^2 – \frac{2(1-\beta_{1,t})\alpha_tg_{t,i}}{(1-\beta^t_1)\sqrt{\hat v_{t,i}}}(\theta_{t,i}-\theta_i^*)
– \frac{ 2\alpha_t\beta_{1,t} m_{t-1,i} }{\sqrt{\hat v_{t,i}}(1-\beta^t_1)}(\theta_{t,i}-\theta_i^*)
+ \alpha^2_t\frac{\hat m^2_{t,i}}{\hat v_{t,i}}.

Summing, and pulling out the term $g_t^T(\theta_t-\theta^*)$

g_t^T(\theta_t-\theta^*) &=& \sum_{i=1}^d\left(\left((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta_i^*)^2\right)\frac{\sqrt{\hat v_{t,i}}}{2\alpha_t }\frac{1-\beta_1^t}{1-\beta_{1,t}} \,+\, \frac{\alpha_t \hat m_{t,i}^2}{2\sqrt{\hat v_{t,i}}}\frac{1-\beta_1^t}{1-\beta_{1,t}} \,-\, \frac{\beta_1 m_{t-1,i}}{(1-\beta_{1,t})}(\theta_{t,i}-\theta_i^*)\right)

The last term can be bound using
|m_{t-1}^T(\theta_t-\theta^*)|\leq \frac{\|\alpha_t^{1/2}\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\alpha_t^{-1/2}\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2}
where $\hat V_t = \mathbf{diag}(\hat v_{t})$, giving
g_t^T(\theta_t-\theta^*) &\leq & \sum_{i=1}^d\left(\left((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta_i^*)^2\right)\frac{\sqrt{\hat v_{t,i}}}{2\alpha_t }\frac{1-\beta_1^t}{1-\beta_{t,1}}\,+\, \frac{\alpha_t \hat m_{t,i}^2}{2\sqrt{\hat v_{t,i}}} \frac{1-\beta_{1}^t}{1-\beta_{t,1}}\right) \\
&&+ \frac{\beta_{t,1}}{1-\beta_{t,1}} \left(\frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}\right)


From convexity , for $\theta^* = \min_\theta \sum_{t=1}^T f_t(\theta)$ we have
$$ f_t(\theta_t)-f_t(\theta^*)\leq g_t^T(\theta_t-\theta^*), $$
R(T) &\leq &   \sum_{t=1}^T\Bigg[\frac{1}{2\alpha_t} \frac{1-\beta_1^t}{1-\beta_{t,1}} (\|\hat V^{1/4}_t(\theta_{t}-\theta^*)\|_2^2 -\|\hat V^{1/4}_t(\theta_{t+1}-\theta^*)\|_2^2  ) \,+\, \frac{\alpha_t}{2} \frac{1-\beta_1^t}{1-\beta_{t,1}}\|\hat V_t^{-1/4}\hat m_t\|_2^2  \\&&+ \frac{\beta_{t,1}}{1-\beta_{t,1}} \left(\frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}\right)\Bigg]\\
&=& R_1(T)+R_2(T)+R_3(T) + R_4(T)


R_1(T) &=& \sum_{t=1}^T\frac{1}{2\alpha_t} (\|\hat V^{1/4}_t(\theta_{t}-\theta^*)\|_2^2-\|\hat V^{1/4}_t(\theta_{t+1}-\theta^*)\|_2^2 )\\
R_2(T) &=&\sum_{t=1}^T \frac{\alpha_t}{2}\frac{1-\beta_1^t}{1-\beta_{t,1}} \|\hat V_t^{-1/4}\hat m_t\|_2^2\\
R_3(T) &=&\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}} \frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} \\
R_4(T) &=&\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}}  \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}

Simple bounds

  • First, note that, element-wise, $$|v_{t,i}| = \sum_{\tau = 1}^t \beta_2^{t-\tau} g_{\tau,i}^2\leq G_\infty^2 \sum_{\tau’ = 0}^{t-1} \beta_2^{\tau’}  \leq  \frac{G_\infty^2}{1-\beta_2}$$
    where we use a change of variables $\tau’ = t – \tau$ and upper bound with the infinite geometric series.  So, $\|v_t\|_2 \leq d\|v_t\|_\infty \leq \frac{d G_\infty^2}{1-\beta_2}$.
  • Another important inequality. Consider two sequences $a_k\geq 0$ and $b_k\geq 0$. By Holder $p = 1$,
    \sum_k a_k = \sum_k \frac{a_k}{\sqrt{b_k}}\cdot \sqrt{b_k} \leq \left(\sum_j \frac{a_j}{\sqrt{b_j}}\right)\left(\max_l \sqrt{b_l}\right)\overset{\|\cdot\|_\infty \leq \|\cdot\|_2}{\leq} \left(\sum_j \frac{a_j}{\sqrt{b_j}}\right)\left(\sqrt{\sum_l b_l}\right)
    \frac{\sum_k a_k}{\sqrt{\sum_l b_l}}  \leq \sum_j \frac{a_j}{\sqrt{b_j}} \qquad\qquad(\oplus)
  • Lemma 2 in [1], fixed in Lemma 2 in [2]. We now need to put a bound on $\sum_t \frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}}$. In [1], they try to do a proof in which this term becomes $O(\sqrt{t})$; in [2], they tightened it to $O(\sqrt{\log(T)})$. In addition to being a better rate, [2] seemed to also have fewer bugs in the proof, so I think it makes sense to just present their version.
    \frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} = \sum_{i=1}^d\frac{\sqrt{1-\beta_2^t}(1-\beta_{1,0})^2}{(1-\beta_1^t)^2(1-\beta_2)}\frac{1}{\sqrt{t}}\frac{(\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j}g_{k,i})^2}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}
    Using Cauchy Schwartz,
    \sum_{k=1}^t(1-\beta_{1,0})\prod_{j=k}^t\beta_{1,j}g_{k,i}\leq \sqrt{\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j} \sum_{l=1}^t \prod_{j=l}^t\beta_{1,j}  g^2_{l,i}}
    so we may write
    \sum_{i=1}^d\frac{(\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j}g_{k,i})^2}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}} &=& \sum_{i=1}^d\frac{\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j} \sum_{l=1}^t \prod_{j=l}^t\beta_{1,j} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
    &=& \sum_{i=1}^d\frac{\sum_{k=1}^t\beta_1^{t-k} \sum_{l=1}^t \beta_1^{t-l} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
    &\overset{\sum_{k=1}^t\beta_1^{t-k}\leq \frac{1}{1-\beta_1}}{\leq}& \frac{1}{1-\beta_1}\sum_{i=1}^d\frac{\sum_{l=1}^t \beta_1^{t-l} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
    &\overset{(\oplus)}{\leq}& \frac{1}{1-\beta_1}\sum_{i=1}^d\sum_{k=1}^t\frac{ \beta_1^{t-k} g^2_{k,i}}{\sqrt{ \beta_2^{t-k}g_{k,i}^2}}\\
    &=& \frac{1}{1-\beta_1}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1\\
    Now, let’s sum over $t$.
    \sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&  \frac{\sqrt{1-\beta_2^t}(1-\beta_{1,0})^2}{(1-\beta_1^t)^2(1-\beta_2)(1-\beta_1)}\sum_{t=1}^T\frac{1}{\sqrt{t}}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1
    The next step is to use the assumption $\beta_1^2\leq \sqrt{\beta_2}$ to get
    \frac{\sqrt{1-\beta_2^t}}{(1-\beta_1^t)^2}(1-\beta_{1,0})^2 \overset{\beta_1^2\leq \sqrt{\beta_2}}{\leq} \frac{\sqrt{1-\beta_1^{4t}}}{(1-\beta_1^t)^2}(1-\beta_{1,0})^2 \overset{\beta_{1,0}\leq 1}{\leq}\frac{\sqrt{1-\beta_1^{4t}}}{(1-\beta_1^t)^2}\leq 1
    since the 1-D curve $g(x) = \frac{1-x^4}{(1-x)^4}$ is decreasing in the interval $0\leq x \leq 1$ and $g(0) = 1$. So, more simply,
    \sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{t=1}^T\frac{1}{\sqrt{t}}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1
    \end{eqnarray*}We now switch the $t$ and the $k$ summations, and use $\tau = t-k$. Then
    \sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{k=1}^T\sum_{t=k}^T\frac{1}{\sqrt{t}}\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1\\
    &\leq&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{k=1}^T\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau+k}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\|g_k\|_1\\
    &\overset{\tau+k\geq\tau}{\leq}&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\sum_{k=1}^T\|g_k\|_1
    Now we can just work with this term
    \sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\overset{C-S}{\leq}\sqrt{\sum_{\tau=0}^{T}\frac{1}{\tau}}\sqrt{\sum_{\tau’=0}^T\frac{ \beta_1^{2\tau’} }{ \beta_2^{\tau’}}}\leq \sqrt{\log(T)+1}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }}
    which gives an overall rate of
    \sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&   \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} \sqrt{\log(T)+1}\|g_{k,1:T}\|_1 \\&=& A \sqrt{\log(T)+1}\|g_{k,1:T}\|_1
    where we absorb the constant
    $$A= \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} .$$

Now we are ready to tackle the terms one by one. First,

Term $R_1(T)$

Looking at each element,

\sum_{t=1}^T\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} ((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta^*_i)^2 )&=&
\frac{\hat v^{1/2}_{1,i}}{2\alpha_1} (\theta_{1,i}-\theta_i^*)^2
-\frac{\hat v^{1/2}_{T-1,i}}{2\alpha_{T-1}} ((\theta_{t,i}-\theta_i^*)^2)\\
\sum_{t=2}^{T-1}\left(\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} -\frac{\hat v^{1/2}_{t-1,i}}{2\alpha_{t-1}}\right) (\theta_{t,i}-\theta_i^*)^2\\
\frac{\hat v^{1/2}_{1,i}}{2\alpha_1} (\theta_{1,i}-\theta_i^*)^2
-\frac{\hat v^{1/2}_{T-1,i}}{2\alpha_{T-1}} ((\theta_{T,i}-\theta_i^*)^2)

only if
\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} \leq\frac{\hat v^{1/2}_{t-1,i}}{2\alpha_{t-1}} \qquad\qquad (\star)

for all $t$! ($(\star)$ was actually the biggest point of [2], which corrects the proof in [1] by forcing this equation to be true in AMSGrad.)

This simplifies

R_1(T) \leq \frac{1}{2\alpha_1} \|\hat V^{1/4}_1(\theta_{1}-\theta^*)\|_2^2-\frac{1}{2\alpha_{T-1}}\|\hat V^{1/4}_{T-1}(\theta_{T}-\theta^*)\|_2^2 \leq \frac{1}{2\alpha} \|\hat V^{1/4}_1(\theta_{1}-\theta^*)\|_2^2 \leq \frac{D^2}{2\alpha}\frac{ \sqrt{G}}{\sqrt{1-\beta_2}}

Term $R_2(T)$

We want to use one of our “small bounds” results here. Explicitly,

R_2(T) = \frac{\alpha}{2} \sum_{t=1}^T \frac{1-\beta_1^t}{1-\beta_{1,0}\lambda^t}\frac{\|\hat V_t^{-1/4}\hat m_t\|_2^2}{\sqrt{t}} \leq \frac{\alpha}{2} \sum_{t=1}^T \frac{1}{1-\beta_1\lambda}\frac{\|\hat V_t^{-1/4}\hat m_t\|_2^2}{\sqrt{t}} \leq \frac{\alpha A}{2(1-\beta_1\lambda)} \sqrt{\log(T)+1}\|g_{k,1:T}\|_1

Term $R_3(T)$

I’m actually a little confused as to why neither paper seemed to really address what was going on with $R_3(T)$. In particular, if we expand,

R_3(T) =\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}} \frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2}

\|\hat V_t^{-1/4}m_{t-1}\|_2^2\, -\, \|\hat V_t^{-1/4}m_{t} \|_2^2 &=& \|\hat V_t^{-1/4}(m_{t-1}-m_t) \|_2^2  \,-\, 2(\hat V_t^{-1/4}m_{t})^T(\hat V_t^{-1/4}(m_{t-1}-m_{t}))  \\
&=& (1-\beta_{1,t})^2 \|\hat V_t^{-1/4}(m_{t-1}-g_t) \|_2^2 \, -\, 2(1-\beta_{1,t})(\hat V_t^{-1/4}m_{t})^T(\hat V_t^{-1/4}(m_{t-1}-g_{t}))
and $(1-\beta_{1,t})^2\to 1$ as $t\to\infty$. So it’s not like we can bound $R_3$ using the same tricks from $R_2$. On the other hand, assuming that $|\hat V^{-1/4}m_{t-1}\|_2^2 $ is a bounded quantity, and $ \frac{\beta_{t,1}}{1-\beta_{t,1}} $ is a summable sequence, it’s more likely that they just saw $R_3(T) \leq \text{constant}$ and just didn’t bother with it. Well, for completeness, let’s bother with it. First,
\|\hat V_t^{-1/4}m_{t-1}\|_2^2 &\leq & \|V_t^{-1/4}m_{t-1}\|_2^2 \\
&\leq& \sum_{i=1}^d \frac{(\sum_{\tau=1}^t\beta_{1}^\tau g_{i,\tau})^2}{\sqrt{\sum_{\tau=1}^t \beta_2^\tau g^2_{i,\tau}}}  \\
&\overset{\oplus}{\leq}&  \sum_{\tau=1}^t \frac{(\beta_{1}^\tau )^2}{\sqrt{ \beta_2^\tau }} \|g_t\|_1\\
&\overset{\|g_t\|_1\leq dG_\infty}{\leq}& dG_\infty \sum_{\tau=1}^t \frac{(\beta_{1}^\tau )^2}{\sqrt{ \beta_2^\tau }}\\
&\leq& \frac{dG_\infty\sqrt{ \beta_2}}{\sqrt{ \beta_2}-\beta_1^2} =:B

so we’re safe!

Term $R_4(T)$

Using $\alpha_t=\alpha/\sqrt{t}$.

R_4(T) &=&\sum_{t=1}^T   \|\hat V_t^{1/4}(\theta_t-\theta^*)\|_2^2\frac{\beta_{t,1}}{2\alpha_t(1-\beta_{t,1})}\\
&=&\frac{D^2}{2\alpha}\sum_{t=1}^T   \|v_t^{1/2}\|_2\frac{\beta_{t,1}\sqrt{t}}{(1-\beta_{t,1})(1-\beta_2^t)}

Note that, element-wise, $$|v_{t,i}| = \sum_{\tau = 1}^t \beta_2^{t-\tau} g_{\tau,i}^2\leq G_\infty^2 \sum_{\tau’ = 0}^{t-1} \beta_2^{\tau’}  \leq  \frac{G_\infty^2}{1-\beta_2}$$
where we use a change of variables $\tau’ = t – \tau$ and upper bound with the infinite geometric series. This gets us to

R_4(T) = \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}}\sum_{t=1}^T\frac{\lambda^t\beta_{0,1}\sqrt{t}}{(1-\lambda^t\beta_{0,1})(1-\beta_2^t)}\leq \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}}\sum_{t=1}^T\lambda^t\sqrt{t}

and using some fairly liberal tricks, for $\lambda \in (0,1)$:
\sum_{t=1}^T\lambda^t\sqrt{t} \leq \sum_{t=1}^T\lambda^t t \leq \int_0^\infty \lambda^t t dt = \frac{1}{(1-\lambda)^2}

we get
R_4(T) = \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}}\sum_{t=1}^T\frac{\lambda^t\beta_{0,1}\sqrt{t}}{(1-\lambda^t\beta_{0,1})(1-\beta_2^t)}\leq \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}(1-\lambda)^2}

Sidebar: After 3 days of pouring through [1] and aided by [2], I really feel like [2] is an underrated work, that really makes the theoretical analysis of [1] clean and persevering. If you think about it, at this point, Adam isn’t just one person’s genius idea, it’s a method that a ton of people are using every day, in a slightly different form than the one presented. So it’s not unreasonable to expect that the convergence analysis should be a group effort, not just the work of a single team. In that spirit…

AMSGrad [2]

In addition to applying band-aids all over the proof of [1], [2] also ended up proposing a new (slightly modified) method, which fixes the issue raised in ($\star$). The fix is pretty simple: using the same updates for $m_t$ and $v_t$, we apply elementwise
\hat v_t = \max(\hat v_{t-1},v_t)

and update

\theta_{t+1} = \theta_t – \frac{\alpha_tm_t}{\sqrt{\hat v_t}}.

This ensures that $(\star)$ is always true, and thus the proof holds well.

Followup works

It is clear also that I am not the first one on the bandwagon, of skepticism, powered by the 1: huge assumptions used in this proof and 2: no clear acceleration in terms of $t$ shown. It turns out that this bandwagon is actually pretty chock-a-full of optimizers singing more or less the same tune. There are more followup works being written to acknowledge the weaknesses of Adam in a better way, one of them being

  • [3] Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search) by Sharan Vaswani, Issam Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien

which argues that adaptive methods can use constant step sizes in the interpolation regime (where $\theta = \theta^*$ implies $\nabla f_t(\theta) = 0$ for all $t$), which seems strong but actually quite frequent in overparametrized neural networks.

Another notable work is

  • [4] A Simple Convergence Proof of Adam and Adagrad, by
    Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier

which extend the proof of Adam’s convergence to nonconvex functions, and show a rate of $O(\log(T)/\sqrt{T})$, matching Adagrad, and improving upon SGD by a factor of $1-\beta_1$. Curious as I am about the guts of this proof, I’m afraid that’s going to have to wait for another day.

Finally, a fun reference. Francesco Orobana also has a nice rant about this too in his blog:

The points here are a bit different from the points I was trying to make, in literally pulling out the weaknesses of Adam; rather, he is the opposite side of the coin which is that the numerical success of Adam is still quite surprising. (I agree.)

Over all, I believe the subject of Adam remains interesting, important, and controversial. Probably better proofs can still be made.


Thanks Sharan Vaswani for helpful discussion in dissecting this.

Projected gradient descent, revisited

It turns out that some of my previous statements on PGD may not have been, well, correct. At first I thought it was just a few typos, but as I studied my notes further, I realized there may be some fundamental flaws. So this blog post revisits some ideas and hopefully irons out some past mistakes.

By the way, there aren’t any mistakes on past blog posts, but there is in my online SpeedOpto notes.


Projected gradient descent (PGD)

First, a review on how PGD works, and what is true about it. Consider the constrained optimization problem

$$ \min_{x\in \mathcal D} f(x)$$

where $f$ is $L$-smooth, and $\mathcal D$ is a closed convex compact set. (For now I will not assume $f$ is convex, since it doesn’t really matter as much. However, $\mathcal D$ convex is important.) Then PGD takes steps as

$$ x^{(k+1)} = \mathbf{proj}_{\mathcal D}(x^{(k)}-\alpha\nabla f(x^{(k)}))\tag{PGD}$$

where $\mathbf{proj}_{\mathcal D}$ is the Euclidean projection solution to the minimization problem

$$\mathbf{proj}_{\mathcal D}(x) = \arg\min_{y\in {\mathcal D}}\|y-x\|_2.$$

Previously, I made the statement that PGD cannot hurt GD, in the sense that

  • PGD approaches a stationary point at least as fast as GD, and
  • PGD has the same objective function decay rate as GD.

The first statement is still true. In particular, in the first case, we have that for any point in $\mathcal D$ (and in particular for $x^*\in \mathcal D$),

$$\|\mathbf{proj}_\mathcal D(x)-x^*\|_2\leq\|x-x^*\|_2$$

with equality only if $x\in \mathcal D$.

However, I am now realizing that we cannot so easily prove the second one. In fact, the usual statements for PGD and Mirror Descent are much weaker, namely, that for constant step size. $f(x)\to f(x^*)$ at a rate of $O(1/k) + O(\alpha \|\nabla f(x^*)\|_2^2)$ and taking $\alpha = 1/\sqrt{k}$ we may  get an aggregate $O(1/\sqrt{k})$ rate. (but it is not a just-in-time rate.)

Second however: if instead, we view the projection operation as a proximal operator of an indicator, we should get back our rate of $O(1/k)$. This I will explore in a follow up post.

What’s wrong with the descent lemma of PGD?

Let us simplify each step of PGD as

\begin{eqnarray*} y &=& x – \alpha \nabla f(x) \\ x^+ &=& \mathbf{proj}_{\mathcal D}(y).\end{eqnarray*}

Now, it’s tempting to try to write equations like this:

f(x^+)-f(x) &\leq& \nabla f(x)^T(x^+-x)  + \frac{L}{2}\|x^+-x\|_2^2\\
&\overset{+y-y}{=}& \nabla f(x)^T(x^+-y )+\nabla f(x)^T(y-x)  + \frac{L}{2}\|x^+-y\|_2^2 + L(x^+-y)^T(y-x) + \frac{L}{2}\|x-y\|_2^2\\
&\overset{y-x = -\alpha \nabla f(x)}{=}&-\alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2+ (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2

However, at this point we actually get stuck, because in general, it is very hard to bound this residual term

$$ r :=  (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2.$$

Previously, I tried to argue that this thing is negative ($r<0$); however, after drawing some pictures, it is clear that  $\nabla f(x)^T(x^+-y ) $ will more often be positive (these two terms are aligned), and for $\alpha < 1/L$, $r>0$. In fact, the best bound I could come up with is actually, for $\alpha \leq 1/L$,

$$r \leq \alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2\tag{descent}$$

which shows descent, but you can’t get a rate from it. (See appendix.) This makes sense, since $f(x^+)$ is not in general less than $f(y)$, so you wouldn’t expect the same technique for GD to give as aggressive a rate (in terms of $f$) as PGD. In fact, usually you would expect the opposite; $y$ steps in the maximally descending direction, but $x^+$ is a fix that brings back feasibility and can only hamper progress in objective value decay.

As a counterexample, first consider something like $f(x) = (x-1)^2$ over the constraint $x \geq 0$, and suppose that we already are at $x^{(k)} = 0$. Clearly, $\|\nabla f(x^{(k)})\|_2 = 2 > 0$, but hereonafter, all future iterates will have the same objective value. So we really can only show descent, not a descent amount that’s related to the gradient of $f$.



the proof (corrected)

After knocking my head around a bit, I have come to the conclusion that the best way to go around this is to actually use the (normal, accepted) way of the mirror descent proofs. First, we need two important inequalities in our arsenal: the 3 point equality

$$2(x-y)^T(x-z) = \|x-y\|_2^2 + \|z-x\|_x^2 – \|z-y\|_2^2\tag{3 pt =}$$

and the “projections bring me closer to the set” inequality (see past posts)

$$\|y-x^*\|_2 \geq \|x^+-x^*\|_2^2\tag{closer}$$

f(x)-f(x^*) &\leq& \nabla f(x)^T(x-x^*)  \\
&=& \frac{1}{\alpha}(x-y)^T(x-x^*)\\
&\overset{\text{(3 pt =)}}{=}&\frac{1}{2\alpha}(\underbrace{\|x-y\|_2^2}_{\alpha^2\|\nabla f(x)\|_2^2} + \|x^*-x\|_2^2 – \|x^*-y\|_2^2)\\
&\overset{\text{(closer)}}{\leq}&\frac{1}{2\alpha}(\alpha^2\|\nabla f(x)\|_2^2 + \|x^*-x\|_2^2 – \|x^*-x^+\|_2^2)

and after telescoping, we get

\frac{1}{k}\sum_{i=1}^k(f(x^{(i)}) – f(x^*))\leq \frac{\alpha}{2}\underbrace{\frac{1}{k}\sum_{i=1}^k\|\nabla f(x^{(i)})\|_2^2 }_{\overline{\nabla f(x^{(k)}}} + \frac{\|x^*-x^{(0)}\|_2^2 – \|x^*-x^{(k)}\|_2^2}{2k\alpha}.

This averaged gradient term (AG) converges to $\frac{\alpha}{2}\|\nabla f(x^*)\|_2^2$ which is not 0 in general. Recall also that we showed PGD descends, so that $f(x^{(i+1)}) \leq f(x^{(i)})$. Thus we can conclude

f(x^{(i)}) – f(x^*)\leq  \frac{\|x^*-x^{(0)}\|_2^2 – \|x^*-x^{(k)}\|_2^2}{2k\alpha} + \frac{\alpha}{2} \|\overline{\nabla f(x^{(k)}})\|_2^2\to O(1/k) + O(\alpha \|\nabla f(x^*)\|_2^2).



proof that

$$r=  (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2 \leq \alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2$$

Actually, this is just a consequence of the definition of projection, which says that, since $x\in \mathcal D$, then

\|x^+-y\|_2 \leq \|x-y\|_2 =\alpha \|\nabla f(x)\|_2.

Therefore, if $\alpha \leq \frac{1}{L}$, then using Cauchy Schwartz,

r&=&(1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2 \\
&\overset{\text{C-S}}{\leq} &(1-\alpha L)\|\nabla f(x)\|_2\|x^+-y\|_2 + \frac{L}{2}\|x^+-y\|_2^2\\
&\overset{\text{closer}}{\leq}&(1-\alpha L)\|\nabla f(x)\|_2\|x-y\|_2 + \frac{L}{2}\|x-y\|_2^2\\
&\overset{\|x-y\|_2=\alpha\|\nabla f(x)\|_2}{\leq}&\alpha (1-\frac{\alpha L}{2})\|\nabla f(x)\|_2^2


Video time! Discretization methods on quadratic and logistic regression

Hi all, this is another “tiny project” that I’ve been wanting to do for a while. Lately, I’ve been somewhat obsessed with method flows and discretization methods. While we can sit here and write integrals until our faces turn blue, I think it’s time to just simulate some stuff, and just see what happens. In this post I explore 4 problems:


  • Quadratic optimization, somewhat ill conditioned (duh.)
  • Normalized quadratic optimization, to look at nonsmoothness and bad curvature.
  • Logistic regression, well separated
  • Logistic regression, poorly separated

The methods include

  • GD = Euler’s discretization (explicit)
  • MDPT = Midpoint method (explicit)
  • RK4 = 4th order Runge Kutta method (explicit)
  • BE = Backward Euler (implicit)
  • BDF2 = 2nd order Backward Differentiation Formula (implicit)
  • TZ = Trapezoidal rule (implicit).

Here, an approximate implicit method = 1 gradient step, and a full implicit method = 100 gradient steps.




$$f(x) = \frac{1}{2}x^TQx$$



Normalized quadratic

$$f(x) = \sqrt{x^TQx}$$


Logistic regression

In the logistic regression models,

$$f(\theta) = \frac{1}{m}\log(1+\exp(-y_ix_i^T\theta)$$

where we generate data using the 2-blob model

$$x_i = y_i c + z_i $$

and $\|c\|_2$ is tuned to create separation. Included is not just the trajectory over the landscape, but also the classification barrier, so you can see whether the suboptimality actually has real effect on bad classification or not.


Well separated logistic regression





Poorly separated logistic regression






I had a few questions going into this tiny experiment, some of which were answered, some of which are still ephemeral.

  • Do implicit methods work? Answer:: Hard to say. They definitely do hug the flow better, and in the case of quadratic or normalized quadratic, they seem to help a lot with conditioning and being able to pick large step sizes. However, for logistic regression, they don’t really contribute well to margin orientation, which seems to be better affected by explicit noisy methods.
  • Are their approximate versions sufficient? Note that an approximate implicit method has the same per-iteration complexity as an explicit method. Answer: yes, the approximate methods are actually almost as good as the full ones, with regards to conditioning robustness and flow hugging!
  • Does it make sense to consider quadratic problems to represent machine learning tasks? Answer: This I kind of knew, but the answer is not really. Well, actually, it depends. For poorly separated blobs, a quadratic approximation is not bad, and seems to well-characterize what’s going on. But, on the other hand, for poorly separated blobs, the margin orientation is somewhat arbitrary and overfits on training data idiosyncrasies. On the other hand, when the blob is well separated, the step size choices, the flow characterization, all of that doesn’t seem to align well with margin orientation!

More studies need to be made in order to capture embedding learning as well, but anyway, it’s called a tiny project for a reason!

Finally, I leave you an image of our two favorite cheerleaders.


Some proofs of continuous time things

In trying to understand how continuous time interpretations can help optimization research, I find myself doing a lot of “recreating the wheel”, but in a nontrivial way; I am getting these proofs that I’m sure are standard knowledge, but it’s not like we learned them back when we were kids. So I’m going to try and go through some of them here.

Here are some super simple proofs to get us started.

Gradient descent

This one is actually super easy and classical, with one of the earliest citations [1]. Here, we’re looking at unconstrained methods

\min_{x} \quad f(x)\tag{(U)}

and the method is very simply

\mathbf x_{k+1} &=& \mathbf x_k   – \alpha \nabla f(\mathbf x_k)

with flow

\dot x(t) &=& -\nabla f(x(t)).

Super classical. Then the convergence proof of the flow is simply

\frac{d}{dt} f(x(t))= \nabla f(x)^T\dot x = -\|\nabla f(x)\|_2^2.

Right away, we get method descending, and (after showing that $\|\nabla f(x)\|_2$ also decays)  a rate

\|\nabla f(x(t))\|_2^2 \leq \frac{1}{t} \int_0^t\|\nabla f(x(\tau))\|_2^2d\tau \leq \frac{f(x(0))-f(x^*)}{t}

I’ve already discussed this one a bit in a previous post, so I’ll just stop here. Note that I did not say anything about convexity, compactness, etc.

Projected gradient descent

Now let’s make the problem slightly more complicated by doing the projected version.  Here, we want to solve

\min_{x\in\mathcal D} \quad f(x)\tag{C}

where $\mathcal D\subset \mathbb R^n$ is a convex, closed set.

We use the repeated iteration

\mathbf x_{k+1} &=& \mathrm{proj}_{ \alpha g}(\mathbf x_k   – \alpha \nabla f(\mathbf x_k))\tag{PGD}


\mathrm{proj}_{g}(z):= \arg\min_{x\in \mathcal D} \; \|x-z\|_2.

One way to reframe the method, then, is iteratively solving the following problem at each step

\mathbf x_{k+1} &=& \arg\min_x \frac{1}{2}\|x – \mathbf x_k   + \alpha \nabla f(\mathbf x_k)\|_2^2 + I_{\mathcal D}(x)

where $ I_{\mathcal D}(x)$ is the indicator for the constraint $x\in \mathcal D$.

From that, we can write the optimality condition

\mathbf x_{k} – \mathbf x_{k+1}  \in    \alpha \nabla f(\mathbf x_k) + \mathcal N_{\mathcal D}(\mathbf x_{k+1}) \tag{PGD-opt}

and $\mathcal N_{\mathcal D}(x)$ is the normal cone of $\mathcal D$ at $x$, described as

\mathcal N_{\mathcal D}(x) := \partial I_{\mathcal D}(x) = \{z : z^T(x-x’) \geq 0, \; \forall x’\in \mathcal D\}. \tag{NC-def}

And, again using approximations of $\dot x(t) \approx \frac{\mathbf x_{k+1} – \mathbf x_{k}}{\alpha}$ we get

-\dot x –  \nabla f(x) \in \mathcal N_{\mathcal D}(x). \tag{PGD-flow}

(I’m  a bit skeptical to do this, because then it seems this method is the same as the subgradient method, which I kind of assumed would have a worse rate. But, then again… maybe not in the flow? After all, what is the harm of nonsmoothness in the continuous time regime? Still, I’d feel more comfortable if some $\frac{d}{dt} \nabla f(x)$ term appeared somewhere. For now we’ll go with it.)

Now here comes a crucial step. In the usual proof of the projected gradient method, we needed to show that projections cannot increase your distance to $x^*$. I wrote this as a “funny little proof” in a previous post. But in the continuous version, what we really want to say is that projections can only slow you down in your desired direction–that is, the $\dot x$ from PGD can’t be faster than the $\dot x$ from good ole GD. Concretely, I want to say that the $\dot x$ defined in (PGD-flow) satisfies

\|\dot x\|_2 \leq \|\nabla f(x)\|_2. \tag{Proj-Contr-Cond.}


This is going to work as follows. We define $v(t) = -\dot x(t) – \nabla f(x(t))$. Then by (PGD-Flow), $v\in \mathcal N_\mathcal D(x)$, or $v^T(x-x’) \geq 0$, for all $x’\in \mathcal D$. I’m going to pick $x’ = x – c \dot x$, for some imperceptibly small $c > 0$. Note that since the trajectory is always feasible, such a $c>0$ should exist, if we assume we initialize the method somewhere in $\mathcal D$.  Then, I can say that $v^T\dot x \geq 0$.

Then, using Pythagorean theorem,

\|\nabla f(x)\|_2^2 = \|v+\dot x\|_2^2 = \|v\|_2^2 +\underbrace{ 2v^T\dot x}_{\geq 0}  + \|\dot x\|_2^2 \geq \|\dot x\|_2^2 \tag{Proj-Contr-Cont}

And we’re done! See below for the super-high-tech figure.

Projected gradient descent. Note the almost Pythagorean relationship between $\dot x$, $v$, and $\nabla f(x)$.

Ok, Now we can move onto the main event: convergence proof!

\frac{d}{dt} f(x(t)) = \nabla f(x)^T\dot x  \leq (\nabla f(x) + v)^T{\dot x} = -\|\dot x\|_2^2

and therefore

\frac{f(0)-f(x^*) }{t}\geq \frac{1}{t}\int_0^t \|\dot x(\tau)\|_2^2 d\tau

e.g. we have an $O(1/t)$ bound on the average speed of the method. Note that it makes perfect sense why we can’t seem to be able to bound the average gradient of the method, since in a constrained problem, $\nabla f(x(t))\not\to 0$.

The proof is more -or-less trivially extended to the proximal case, e.g. we solve

\min_{x} \quad f(x) + g(x)\tag{(U2)}

using the repeated iteration

\mathbf x_{k+1} = \mathrm{prox}_{ \alpha g}(\mathbf x_k   – \alpha \nabla f(\mathbf x_k)), \qquad
\mathrm{prox}_{g}(z):= \arg\min_x \; g(x) + \frac{1}{2}\|x-z\|_2^2.

Edit: It occurred to me that a very strange thing happens when $x(t)$ is on a corner of $\mathcal D$, which brings out a problem with nonsmoothness. In this case, $\dot x(t)$ is not really well defined: we have

\dot x_+(t) = \lim_{\Delta>0,\Delta \to 0} \frac{x(t+\Delta)-x(t)}{\Delta}, \qquad \dot x_-(t) = \lim_{\Delta>0,\Delta \to 0} \frac{x(t)-x(t-\Delta)}{\Delta}

A literal “corner” case, showing that it matters if the velocity is coming from the future or the past.

and they’re both different. Note that our relationship (Proj-Contr-Cont) only applies to $\dot x_+$, not $\dot x_-$ in this case. See image above.


[1] Polyak, Boris T. “Introduction to optimization. 1987.” Optimization Software, Inc, New York.


A fun convex set fact

This is actually not a new thing at all, but a thing I played with a couple years ago, and I ended up using in a numerical optimization lecture to try to impress some undergrads. They were not impressed. Since then I’ve tuned it a bit and just kind of take it for granted now, but now that I think about it, it is kind of nontrivial, and has some interesting significance. It’s come up again in the short course, so I figured I’d write it up here as well.

Statement: Suppose $\mathcal C$ is a convex set in $\mathbb R^n$, and $y = \mathbf{proj}_{\mathcal C}(z)$, for any point $z\in \mathbb R^n$. (Euclidean projection.) Then for any point $x\in \mathcal C$, $$\|y – x\|_2 \leq \|z-x\|_2.$$

Significance: This is part of a larger story that says “projecting on convex sets cannot hurt convex optimization”.  Basically, you should think of $x = x^*$ as an optimal part of some constrained optimization problem. This basically says that my contraction is a give-in, that every time I do descent-step-plus-projection, the projection step will always drive me closer to $x^*\in \mathcal C$. What I found interesting and nontrivial, though, is that this proof does not depend on any objective function $f$, so that while we really want to use this statement for $x = x^*$, it actually applies for any $x\in \mathcal C$.

Proof by picture:

Note that the assumption that $\mathcal C$ is a convex set is essential; counterexamples easily exist when the set is not convex.

Proof: First, we can assume that $z\not\in \mathcal C$, since if $z\in \mathcal C$, then $z = y$ and the statement is trivially true.

Second, we should assume that $x\neq y$. Since we know that the projection on a convex set is always unique, we can encode this further and say that there exists some $\epsilon > 0$ where $\|y-z\|_2 + \epsilon \leq \|x-z\|_2$.

Now, the fact that $y$ is the projected point means that it’s the solution to the convex optimization problem
$$ y = \arg\min_{y’\in \mathcal C} \; \|y’-z\|_2^2,$$ and thus we can apply the normal cone condition $-\nabla f(y)\in \mathcal N_{\mathcal C}(y)$:

(z-y)^T(x-y)\leq 0 \qquad\qquad \text{(normal cone condition, since $x^*\in \mathcal C$)}.


\|z-x\|_2^2 &=& \|(z-y) + (y-x)\|_2^2\\
&=& \underbrace{\|z-y\|_2^2}_{\geq 0} + \|y-x\|_2^2 + 2\overbrace{\underbrace{(z-y)^T}_{-\nabla f(z)^T}(y-x)}^{\geq 0 \text{ by N.C.}}\\
&\geq &

You may now be asking yourself, ok, but where did convexity fit in? Isn;t this only supposed to work when $\mathcal C$ is a convex set? (You can easily draw counterexamples that are nonconvex; see the star figure above.) Well, the normal cone of a nonconvex set $\mathcal S$ is the same as the normal cone of its convex hull $\mathbf{conv}(\mathcal S)$. When doing projections on $\mathcal S$ nonconvex, there is no way of ensuring that the  negative gradient will lie in the normal cone, which is a key property needed to make this proof work. But, if you are lucky and it does, then it probably means that you are taking steps that work equally well in the convex relaxation, and all will be fine! (Don’t count on it though!)



After posting findings

The extension of this statement to general Bregman divergences can be found in Nick’s course notes: https://www.cs.ubc.ca/~nickhar/F18-531/Notes20.pdf

Edit (July 2022): Thanks Mathurin Massias who  showed me how to drastically shorten the proof. (The previous version was like 5x longer.)

Small proof: $L$-smoothness definitions and equivalences


In a former version of this post, I conjectured that while $L$-smoothness and convexity were not equivalent in definition, they were tied in some of the “equivalent” (not equivalent) versions of their statements.

To clarify, my former statement:

“Convexity and smoothness are totally independent concepts and should not be confused with each other “

I now still believe is still true, and (about 80%) carries through to equivalent forms as well. Thus no groundbreaking optimization proofs were harming in the creation of this blog post.

But, there are still minor details to iron out.


The following are all considered “equivalent” definitions of $L$-smoothness. (tl;dr: they are not equivalent).

  • [D] $\|\nabla f(x)-\nabla f(y)\|_2\leq L\|x-y\|_2$
  • [Q] $f(x)-f(y)-\nabla f(y)^T(x-y) \leq \frac{L}{2}\|x-y\|_2^2$
  • [Q2] $ g(x) = \frac{L}{2}x^Tx-f(x)$ is convex
  • [M] $(\nabla f(x)-\nabla f(y))^T(x-y) \leq L\|x-y\|_2^2$
  • [H] Largest eigenvalue of $\nabla^2 f(x)$ is less than $L$

Now, some details:

  • [D] is in fact the definition of $L$-smoothness. It does not and cannot depend on convexity, since if $f$ is $L$-smooth, then so is $-f$.
  • [Q], [Q2], [M] are all equivalent. See also citation [1] below.
  • Under the assumption that the Hessian exists everywhere, [Q], [Q2], [M] $\iff$ [H].

New conjecture (probably exists somewhere, I’m too lazy to find it and rather prove it myself):

  1. [D] $\iff$ [Q], [Q2], [M] applies for $f$ and $-f$
  2. [D] + Hessian exists everywhere $\iff$ largest magnitude eigenvalue of $\nabla^2 f(x)$ is less than $L$
  3. [Q], [Q2], [M] alone are trivially true for concave functions.
  4. Relationship with convexity: [D] $\Rightarrow$ [Q], [Q2], [M].  If additionally the function is convex, then [Q], [Q2], [M] $\Rightarrow$ [D]. But it is a sufficient, not necessary condition (as evidenced by $f(x) = \sin(x)$ is $L$-smooth.)

Proof of statement 3: This is pretty simple, and we can just go after [Q2]: if $f$ is concave then $-f$ is convex, and thus $g$ is convex.

Proof of statement 2: This is more of a logic statement than anything else. It is still true that $L$-smoothness forces Hessian magnitude to be less than $L$, and once simple way to see this is to use the Taylor approximation trick: e.g. given that $\nabla^2 f(x)$ is continuous everywhere, for every $x$ and $y$, there exists some $w$ where $$\nabla f(x)-\nabla f(y) = \nabla^2 f(w)(x-y).$$ Therefore, $$\|\nabla f(x)-\nabla f(y)\|_2^2 = \|\nabla f(w)\|_2^2\|x-y\|_2^2,$$ which imposes that $\|\nabla f(w)\|_2 \leq L$ if [D] is to hold.

Proof of statement 4: First, without extra bells or whistles, Nesterov already showed us that [D] $\Rightarrow$ [Q], [Q2],[M]. To see this, we just prove for [Q]. First, the Bregman divergence can be written as an integral over inner products:
$$f(x)-f(y) = \int_0^1 \frac{\partial f (u(s))}{\partial s} ds = \int_0^1 \sum_{k=1}^n \frac{\partial f(u)}{\partial u_k} \frac{\partial u_k(s)}{\partial s} ds= \int_0^1 \nabla f(u(s))^T(x-y) ds$$
where $u(t) = (1-t) y + t x\in \mathbb R^n$. Therefore,
$$f(x)-f(y)  – \nabla f(y)^T(x-y)= \int_0^1 (\nabla f(u(s)) – \nabla f(y))^T(x-y) ds$$
and using Cauchy-Schwartz and imposing $L$-smoothness,

$$f(x)-f(y)  – \nabla f(y)^T(x-y) \overset{\mathrm{C-S}}{\leq}  \int_0^1 \|\nabla f(u(s)) – \nabla f(y)\|_2\|x-y\|_2 ds \overset{\mathrm{D}}{\leq} L \int_0^1 \|u-y\|_2\|x-y\|_2 ds.$$

Since $\|u-y\|_2 = s\|x-y\|_2$, the integral reduces to
$$L \int_0^1 \|u-y\|_2\|x-y\|_2 ds = L \|x-y\|_2^2\int_0^1 s ds = \frac{L}{2}\|x-y\|_2^2.$$

Now suppose [Q2] is true, and also $f$ is convex. And, for fun, let’s just assume that the Hessian exists everywhere. Then, since the eigenvalues of $\nabla^2 g(x)$ are $\geq 0$ and $f$ is convex, then $0 \preceq \nabla^2 f(x) \preceq LI$. Note that [D] only requires $-LI \preceq \nabla^2 f(x) \preceq LI$; hence the sufficient but not necessary bit.

Finally, Proof of statement 1: Suppose that [Q2] holds for $f$ and $-f$. Then the eigenvalues of $f$ must be bounded between $LI$ and $-LI$. This exactly gives the Hessian condition necessary for [D].

A could-be-fun-addendum: proving all of this without using the Hessian, aka for cases where the Hessian may not exist.

And, since it’s been a while, a cozy photo of a little friend


[1]: Blog of Xingyu Zhou, 2 pages in particular: https://xingyuzhou.org/blog/notes/Lipschitz-gradient and https://xingyuzhou.org/blog/notes/strong-convexity. By far the best set of reference notes I have seen on this topic.

Small proof: The gradient norms in gradient descent descends.

I’ve been racking my brain lately if there is an appropriate place to insert “small proofs”, e.g. stuff that isn’t trivial, I can’t find it anywhere, but isn’t cool enough to inspire a whole new paper. For now, I think I’ll put it here.


We investigate $\underset{x}{\min}\, f(x)$ using gradient descent

$$ x^{(k+1)} = x^{(k)} – \alpha \nabla f(x^{(k)}) \qquad\qquad\text{(GD)}$$

As I’m working to pull together notes for a short course, I realized all my proofs are “average case proofs”, e.g. averaged function suboptimality or averaged gradient norm squared converges like $O(1/k)$. But because GD is a “descent method”, the averaged function suboptimality is a last-iterate-proof in disguise, since

$$f(x^{(k+1)})\leq f(x^{(k)}) \Rightarrow f(x^{(k)}) \leq \frac{1}{k} \sum_{i=1}^k f(x^{(i)}) = O(1/k).$$

I wanted to do the same thing for the gradient norm case, but I could not find this proof around, even though intuitively it should be true. So I concocted one. Please let me know if this already exists, otherwise please cite this thing (url buried somewhere is probably sufficient) if you would like to steal it.

Statement: If $f$ is convex and $0<\alpha < 2/L$ then $f(x^{(k+1)})\leq f(x^{(k)}) $ for all $k$.

If this is true, then we can generalize average gradient results to last iterate results.

$$\|\nabla f(x^{(k)})\|_2^2 \leq \frac{1}{k} \sum_{i=1}^k \|f(x^{(i)})\|_2^2 = O(1/k).$$



$$ \| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = ( \nabla f(x^{(k+1)})-\nabla f(x^{(k)}))^T( \nabla f(x^{(k+1)})+\nabla f(x^{(k)})).$$

Using Taylor’s approximation theorem, there should exist some $z$ “between” $x^{(k)}$ and $x^{(k+1)}$  where $H = \nabla^2 f(z)$ satisfies

$$H(x^{(k+1)}-x^{(k)} ) = \nabla f(x^{(k+1)})-\nabla f(x^{(k)}).$$

Using this substitution,

\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 &=& (H(x^{(k+1)}-x^{(k)} ) )^T(H(x^{(k+1)}-x^{(k)} ) +2\nabla f(x^{(k)}))\\
&=& (H(x^{(k+1)}-x^{(k)} ) )^T(H(x^{(k+1)}-x^{(k)} ) -\frac{2}{\alpha}(x^{(k+1)}-x^{(k)})).

Now we make life easier and just write $s = x^{(k+1)}-x^{(k)}$ to get

\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = s^TH^THs -\frac{2}{\alpha} s^THs.

Now we invoke convexity. Since $f$ is convex, then a Cholesky decomposition exists $H = LL^T$, and we encode $u = L^Ts$, to get

$$ \| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = u^THu -\frac{2}{\alpha} \|u\|_2^2.$$

which is $\leq 0$ if $0<\alpha < 2/L$.


It’s unclear to me if convexity is really needed, or if smoothness alone should be sufficient. It seems like there should be a way around requiring convexity, since the biggest property used was $Lu^Tu \geq u^THu$ which is true even if $f$ is not convex, but other little parts of the proof seems nontrivial. Maybe to be continued…

Edit: quick counterexample for nonconvex case: Suppose $f(x) = -x^2$ and start with $x^{(0)}\neq 0$. Then at every step, the gradient norm necessarily increases. We could say it still has an attainable min if we substitute this for, say, a sine function, but this violation still happens. Sad.

Edit: related works, generated through discussion:

Achievable online predictions

Hi all, it’s been a while! I took a step back from this blog to focus on teaching and research and such, but now that the last class is over for the fall, time to get back to this!

Today’s topic is now focused more on learning theory, which is an area I’m trying to learn more about. I’m going to cover Lemma 1 (and related topics) from [1], which was first introduced in [2]:

[1] A tutorial on Online Supervised Learning with Applications to Node Classification in Social Networks, by Alexander Rakhlin and Karthik Sridharan.

[2] Behavior of sequential predictors of binary sequences, by Thomas Cover.

Many thanks to Baojian for the topic introduction and Reza and Nikhil for discussions.

The scenario

The goal here is sequence prediction. Let’s make life easy for us so far, and say that we are dealing with binary sequences, e.g. $y_t \in \{1,-1\}$ for $t = 1,… $. The goal is to estimate $\hat y_t\in \{1,-1\}$ based on observations $y_1,…,y_{t-1}$, using a distribution $p_t = \text{pr}(\hat y_t = 1)$, and we will represent this with a surrogate variable $q_t\in[-1,1]$ where $q_t = \mathbb E[\hat y_t]$ under the distribution $p_t$.

Some examples

  • Suppose that the sequence is extremely repetitive. That is, either $y_t = (1,1,1,1,…)$ or $y_t = (-1,-1,-1,-1,…)$. Then by $t = 2$, we can make perfect predictions by just picking $q_t = y_1$.
  • Suppose that the sequence has repeating blocks of length $K$. That is, if $K = 3$, then a possible $y_t = (1,1,1,-1,-1,-1,-1,-1,-1,1,1,1,-1,-1,-1,…)$. Then we could at best do something with 60% accuracy, if we adopt the strategy of $\hat y_t = y_{t-1}.$ However, note that we could also construct an adversarial $y’_t = (1,-1,1,-1,1,-1,…)$ where this strategy would fail 100% of the time.

These examples show that in some real world prediction scenario, where there is some semblance of structure in the sequences, we might be able to come up with some reasonable algorithm. However, the existence of a not-terrible-algorithm depends on a structural prior, e.g. the set of allowable sequences must exist in some set smaller than $2^T$ where $T$ is the sequence length.

This concept also made more sense to me after I learned that these ideas were first explored by Thomas Cover in [2],  who primarily deals with codes over noisy channels. (The second example given is exactly a repetition code.) It makes sense to conclude that in the absence of any codes, no message can survive noise; on the other hand, enforcing some kind of coding technique limits the types of messages ultimately sent , and also limits the types of messages of which we need to “protect” from noise.

Multiclass extension

In order to borrow most of the tools from the proofs in [1], the rest of the post will discuss this scenario in the multiclass framework, e.g. instead of $y_t\in \{-1,1\}$, rather we consider $y_t\in \{1,2,…,K\}$ for some $K\geq 2$. (Note that $K = 2$ reduces to the binary case.) In other words, feasible strings with some kind of structural prior to keep in your head may look like:

  • (1,1,1,2,2,2,1,1,1,3,3,3,1,1,1,4,4,4,3,3,3,2,2,2)
  • (cat, cat, cat, cat, cat, cat, cat, cat, cat) (as opposed to dog)
  • (春,春,夏,秋,秋,秋,冬,春,夏,夏,夏,秋,秋,冬,冬,冬,冬)

You get the picture.

The no-free-lunch theorem

The first theorem we explore is an “online version” of the no free lunch theorem. In particular, it says that for any algorithm $q_t$, if we do not impose a prior, then for binary sequences of length $T$, it must be that

$$\mathbb E_y \mathbb E_{\hat y\sim p}\left[\frac{1}{T}\sum_{t=1}^T \mathbb 1_{\hat y_t\neq y_t}\right] = \frac{1}{2}$$

and more generally, for $K$-class sequences, where both $y_t$ and $\hat y_t$ can pick values in $\{1,2,…,K\}$,

$$\mathbb E_y \mathbb E_{\hat y\sim p}\left[\frac{1}{T}\sum_{t=1}^T \mathbb 1_{\hat y_t\neq y_t}\right] = 1-\frac{1}{K}.$$

The proof for this is quite straightforward. The key observation is to see that there is no such thing as a “strong strategy $p$”, and in fact, for any $p$,

$$\mathbb E_{y_T} \mathbb E_{\hat y_T\sim p}\left[ \mathbb 1_{\hat y_t\neq y_t}| y_1,…y_{T-1}\right] = 1-\frac{1}{K}.$$

The second example above shows this tragedy; no matter what you predict, there exists an equal number of scenarios where that prediction was wrong. Applying this across all time $t = 1,…,T$ generalizes the result.

Cover’s achievability lemma

But wait! Is this end of the story? Shall we despair and just call it quits? Nah, actually this is where Cover’s achievability theorem arises, and it’s quite cool. It basically says that while we cannot improve the universal accuracy of any algorithm, we can impose the way that it fails, by deciding which sequences we prioritize and which we throw in the gutter. (Think repetition codes = good, arbitrary sequences = bad.)

We call this goal function $\phi:\{-1,1\}^T\to [0,1] $, e.g. $\phi(y) = $ the probability that the sequence $y\in \{-1,1\}^T$ is guessed wrong. We say that $\phi$ is stable if for any two binary sequences $y_1\in \{-1,1\}^T$ and $y_2 \in \{-1,1\}^T$, if the Hamming distance between $y_1$ and $y_2$ is at most 1, then $|\phi(y_1)-\phi(y_2)|\leq \frac{1}{T}$. Written more explicitly,

$$ |\phi(…,1,…)-\phi(…,-1,…)|\leq \frac{1}{KT}. $$

Then Cover’s theorem says that this error distribution $\phi$ is achievable if and only if $\mathbb E_y[\phi(y)] \geq 1/2$ for binary sequences, and $\mathbb E_y[\phi(y)] \geq 1-1/K$ for multiclass sequences. Intuitively, this kind of makes sense, but there still is a somewhat finality in the statement, which says that while we can move things around, we really can never improve on our worst case scenario.

The rest of this blog post covers the proof of this achievability lemma, which is constructive in actually showing how one can define $q_t$ and $p_t$ given $\phi$.

Proof of Cover’s lemma

Water filling

Let us first consider the following multiclass min-max problem

$$\min_{q\in \Delta_K } \max_{y\in \{1,2,…,K\}} E_{\hat y\sim q} [ 1_{\hat y\neq y}] + R(y) \qquad (\star).$$

Now define a vector $r$ where $r_k = R(k)$, and assume that $r_i-r_j \leq 1/K$ for all $i,j$, e.g. stability condition. Define also $e_y$ the one-hot vector whose $y$th term is 1. Then $(\star)$ becomes

$$\min_{q\in \Delta_K } \max_{y\in  \{1,2,…,K\}} (\mathbf{1}-q+r)^Te_y \qquad (\star\star)$$

and therefore the optimal $y = \arg\max_k\,  1-q_k+r_k$. Therefore if $q$ wants to minimize things, it will try to force the maximum value of $\mathbf{1}-q+r$ to be as small as possible, e.g.

$$(\star) \iff (\star\star)\quad \iff\quad \min_{q\in \Delta_K } \max_k \; 1-q_k+r_k.$$

This is accomplished by “water filling”:

  1. Start with $q_k = 0$ for all $k$.
  2. At each iteration, find all the indices $k = \arg\max_k \; 1-q_k+r_k$, and then fill up $q_k$ until a (or several) new indices $i$ are in the max.
  3. Keep doing this until $\sum_k q_k = 1$, e.g. you run out of “water supply”.

Mathematically, this means that  $q_k =\max\{0, r_k-c\}$, where $c$ is chosen such that  $\sum_k q_k =1$. See below an example where $K = 9$.

Note also that the stability condition enforces that $\sum_{i=1}^K r_{\max}-r_i \leq 1$, so in fact the flag $q_k = 0$ will never happen, and we can just write $q_k = r_k-c$. Explicitly, this means

$$ c = \frac{ \sum_{i=1}^K r_i }{K}- \frac{1}{K}$$


$$(\star)\quad =\quad  1+c =  \frac{\sum_{i=1}^K r_i}{K}+\left(1-\frac{1}{K}\right) = \mathbb E_{y\in \{1,…,K\}} [R(y)] + \left(1-\frac{1}{K}\right).$$

Work backwards.

Let us now introduce what everything is.

Let’s first write out our regret over a predetermined sequence $y$, as

$$\text{Total regret}(y) =  E_{q_1}[\mathbb 1_{y_1\neq \hat y_1}] + E_{q_2|y_1}[\mathbb 1_{y_2\neq \hat y_2}] + E_{q_3|y_1,y_2}[\mathbb 1_{y_3\neq \hat y_3}] + \cdots + E_{q_T|y_1,…,y_{T-1}}[\mathbb 1_{y_T\neq \hat y_T}]$$

Again, we cannot reasonably bound the total regret over all choices of $y$, but we can (and want to) reasonably show that this total regret faction can be upper bounded by $T\cdot \phi(y)$, provided $\phi(y)$ has the correct stability condition, and has average $\mathbb E_y[\phi(y)] = 1-1/K$.

Suppose now that at iteration $t-1$, we had incurred $M_{t-1}(y_1,…,y_{t-1})$ mistakes. Then the expected number of total mistakes at step $t$ will be

$$M_t(y_1,…,y_t) = \mathbb E_{q_t}[\mathbb 1_{y_t\neq \hat y_t}] + M_{t-1}(y_1,…,y_{t-1}).$$

We impose that $M_T(y_1,…,y_T) = T\phi(y_1,…,y_T)$. Then, we work backwards, and see that by assigning $R_t(y_t) = -M_t(y_1,…,y_t)$, and using the previous waterfilling method (optimal $q$ for worst case $y$),

$$ \min_q\max_y \mathbb E_{q_t}[\mathbb 1_{y_t\neq \hat y_t}] – M_t(y_1,…,y_t)=\left(1-\frac{1}{K}\right) – \frac{\sum_{y_t} M_t(y_1,…,y_t)}{K}.$$

Therefore, using this strategy $q$,

$$-M_{t-1}(y_1,…,y_{t-1}) = \mathbb E_{q_t}[\mathbb 1_{y_t\neq \hat y_t}] – M_{t}(y_1,…,y_{t})
\leq \left(1-\frac{1}{K}\right) – \mathbb E_{y_t}[M_t(y_1,…,y_t)] .

By telescoping, we get to

$$-M_0 \leq T\left(1-\frac{1}{K}\right) –  \mathbb E_y[\phi(y_1,…,y_T)]$$

which, under the assumption that $\mathbb E_y[\phi(y_1,…,y_T)] \geq \left(1-\frac{1}{K}\right) $ implies that $M_0 \geq 0$, e.g. this solution is  feasible. Since it was optimized at each step, it is also optimal.

What’s the point?

First off, there is some beauty of course in knowing that while the no free lunch theorem says you can’t ever do better than random on average, you can decide how that randomness looks. In machine learning, this means you can force label patterns that don’t appear often to be always predicted wrong, but label patterns that appear all the time to be almost always correct. What’s more, the proof is constructive, and the sequence $q_t$ exactly gives the algorithm!

The other, more practical end of it is just that it makes proof writing for this type of problem easier in general. From now on, rather than providing algorithms and proofs, all I have to say is present a $\phi$ a function over all label sequences, and show that it satisfies

  • stability: $\phi(…,i,..)-\phi(…,j,…)<\frac{1}{TK}$
  • average error $\mathbb E_y[\phi(y)] \geq (1-1/K)$

and I know that an algorithm exists to achive this $\phi$! Then, taking expectations over a distribution more amenable to my data, say, $\mathcal D$, gives the new performance result!

  • average error over all possible labels:  $\mathbb E_{y\sim \mathrm{Unif}}[\phi(y)] = (1-1/K)$,
  • regret I actually care about $\mathbb E_{y\sim \mathcal D}[\phi(y)]) = $ possibly much lower!

In other words, while there is in general no free lunch, if I know where to look I may still find some cheaper lunches lying around! (at the cost of expensive lunches elsewhere…)