We need to talk about Adam

At some point in every optimist’s life, we are going to be asked the age-old question: “How does your method compare with Adam?” or, even more annoyingly, “Does this analysis help understand Adam?” I mean, Adam is a nice method that works very well for a very unique set of optimization problems, and not, like, the answer to all the problems in the universe (42). That being said, even if Adam (the method) isn’t exactly Adam (the first man), we should probably know a little bit about what it’s about.

In case you were living under a rock for the past 10 years, Adam refers to ADAptive Moment estimation, this magical method that came out in 2015 by Kingma and Ba, and is widely hailed as the one training method that actually seems to work well. It was the latest in adaptive step size methods, starting with AdaGrad (Duchi, Hazan, Singer 2011), but including AdaDelta (Zeiler 2012), RMSProp (Tieleman & Hinton, 2012), etc.

Adam (Kingma and Ba, 2015)

(typo above: should be $\alpha_t = \alpha/\sqrt{t}$, not constant $\alpha$)

Not Adam Or Adam

Anyway, so I finally decided it was time to bite the bullet, and learn about what this Adam thing is all about, follow the journey to a proper convergence proof, and give some comments as to how it may work in practice.  This post will largely be a literature survey about two parts of this story: the original paper, and the followup proof

[1] Kingma and Ba. Adam: A Method for Stochastic Optimization

[2] Reddi, Kale, Kumar. On the Convergence of Adam and Beyond.

Related prior works

As previously mentioned, Adam comes as a member of the “adaptive step size committee”, a string of methods that followed AdaGrad, the often hailed seminal work of Duchi et al who first proposed adaptively adjusting the step size to help deep network training deal with the vanishing gradient problem. Followup work, like RMSProp, and Adam itself, can be seen as “descendants” of this work.  While that paper in itself holds an important place in this journey, in the interest of time I will not mention it much more here.

What the papers do vs what people actually do

Before going on, I need to point out, indeed, highlight, three assumptions (1-3) that are critical in the proofs but clearly not used in practice. Additionally, we will mention a bug (4), which was later fixed in [2], and just for completeness a minor thing to remember (5).

  1. The proof assumes $f_t$, the online loss function, is convex in its parameters.
  2. The proof uses a decaying step size $\alpha_t = \alpha/\sqrt{t}$. In practice, we use a constant $\alpha$.
  3. The proof uses a decaying mixing parameter $\beta_{1,t} = \beta_{1,0}\lambda^{t-1}$ for some $0 < \lambda  < 1$.
  4. (Bug, fixed in later work) We must require $\sqrt{v_t}/\alpha_t$ is a decreasing sequence.
  5. (Minor) the proof assumes $\epsilon = 0$. (In practice, $\epsilon \approx 0$.)

Now, I should do a disclaimer and say that everyone and their sister, including me, often resorts to unreasonable assumptions the night before the deadline. It’s unsatisfying when theory doesn’t quite match practice, but the truth is doing convergence proofs is hard, especially for very general assumptions. However, what makes me feel that this is a bit less ok than the usual case is that Adam is often hailed as the method to use ([1] is cited over 100,000 times). Therefore, the fact that we as a community don’t seem as interested in closing this loop is a bit baffling to me. For this reason, at the end of this post, I will try to highlight a few papers that tried to “fix the proof” by justifying these assumptions. If I have missed out a work in this category that you know of, please do let me know and I will update it.

(By the way, I want to clarify, this is not a rant against the authors of [1]. I am not trying to take away the significance of their contribution. It’s more of a question of whether, as a community, we are satisfied with this story being “done”.)

Proof of Adam ([1], with help from [2])

For $g_t = \nabla_\theta(f_t(\theta_{t-1}))$, consider the method

m_t &=& \beta_{1,t}m_{t-1} + (1-\beta_{1,t})g_t, \quad \hat m_t = \frac{m_t}{1-\beta_1^t}\\
v_t &=& \beta_{2}v_{t-1} + (1-\beta_{2})g^2_t, \quad \hat v_t = \frac{v_t}{1-\beta_2^t}\\
\theta_t &=& \theta_{t-1}\, -\, \frac{\alpha_t }{\sqrt{\hat v_t}}\hat m_t

where we initialize $m_0 = v_0 = 1$ and $\beta_{1,0} = \beta_1$. Here, $m_t,v_t,\theta_t,g_t$ are all vectors in $\mathbb R^d$ and with an extreme abuse of notation, all multiplications and divisions between these vectors are taken element-wise. The scalars are $\alpha_t,\beta_{1,t},\beta_2$, and of course $t$.

Theorem: Assume that

  • the functions $f_t(\theta) = \mathcal L_\theta(x_t,y_t)$ are convex in $\theta\in \mathbb R^d$
  • the gradients of $f_t$ are bounded everywhere $\|\nabla f_t(\theta)\|_2 \leq G$, $\|f_t(\theta)\|_\infty \leq G_\infty$
  • the distance between any two $\theta_t$ generated by Adam is bounded $$\|\theta_n-\theta_m\|_2\leq D, \quad \|\theta_m-\theta_n\|_\infty \leq D_\infty, \text{  for any   } m,n \in \{1,…,T\}$$

Pick parameters such that $\beta_1,\beta_2 \in [0,1)$, $\beta_1^2/\sqrt{\beta_2} < 1$, $\alpha_t = \alpha/\sqrt{t}$, $\lambda\in (0,1)$, and define $\beta_{1,t}=\beta_1\lambda^{t-1}$. Then for all $T\geq 1$, the regret $$R(T) = \sum_{t=1}^Tf_t(\theta_t)-\min_\theta \sum_{t=1}^T f_t(\theta)$$ is upper bounded
\frac{D^2}{2\alpha}\frac{ \sqrt{G}}{\sqrt{1-\beta_2}}
+ \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} \frac{\alpha\sqrt{\log(T)+1} }{2(1-\beta_1\lambda)} \|g_{1:T}\|_1 \\
+ \frac{1}{1-\beta_1} \frac{dG_\infty\sqrt{ \beta_2}}{\sqrt{ \beta_2}-\beta_1^2} + \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}(1-\lambda)^2}
In terms of big-O, the first, third, and fourth term are constants, and overall we get an $R(T) = O(\sqrt{\log(T)}\|g_{1:T}\|_1$ term. Here I should note that in the two papers [1] and [2], actually this is a $\|g_{1:T}\|_2$  term, but in recreating the proof I was not able to recover that. That being said, these norms are exchangable, so we’ll just leave it as is. The  $\|g_{1:T}\|_2$ also appears in AdaGrad, and the argument is that if your gradients are sparse enough, this guy is also $O(\sqrt{T})$. Seems like a rough assumption, but let’s go with the flow for now.

Additionally, it should be noted that there does not appear to be a significant speedup over SGD. In [1], the argument is that Adam is $O(\log(d)/\sqrt{t})$, (which got boosted in terms of $t$ by [2]), but in no way can I recreate the proof so that the dependency on $d$ is less than linear. (I’d be happy to correct it if people feel differently.)

Either way, the fact that Adam does not claim to have a faster convergence rate than SGD in terms of $t$, is somewhat significant.


Now let’s do the proofs

Using the update rule $\theta_{t+1}=\theta_t-\frac{\alpha_t\hat m_t}{\sqrt{\hat v_t}}$ where $\epsilon = 0$
then we may write

(\theta_{t+1,i}-\theta_i^*)^2&=&(\theta_{t,i}-\theta_i^*)^2 – 2\alpha_t\frac{\hat m_{t,i}}{\sqrt{\hat v_{t,i}}}(\theta_{t,i}-\theta_i^*) + \alpha^2_t\frac{\hat m^2_{t,i}}{\hat v_{t,i}}\\
&=&(\theta_{t,i}-\theta_i^*)^2 – \frac{2(1-\beta_{1,t})\alpha_tg_{t,i}}{(1-\beta^t_1)\sqrt{\hat v_{t,i}}}(\theta_{t,i}-\theta_i^*)
– \frac{ 2\alpha_t\beta_{1,t} m_{t-1,i} }{\sqrt{\hat v_{t,i}}(1-\beta^t_1)}(\theta_{t,i}-\theta_i^*)
+ \alpha^2_t\frac{\hat m^2_{t,i}}{\hat v_{t,i}}.

Summing, and pulling out the term $g_t^T(\theta_t-\theta^*)$

g_t^T(\theta_t-\theta^*) &=& \sum_{i=1}^d\left(\left((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta_i^*)^2\right)\frac{\sqrt{\hat v_{t,i}}}{2\alpha_t }\frac{1-\beta_1^t}{1-\beta_{1,t}} \,+\, \frac{\alpha_t \hat m_{t,i}^2}{2\sqrt{\hat v_{t,i}}}\frac{1-\beta_1^t}{1-\beta_{1,t}} \,-\, \frac{\beta_1 m_{t-1,i}}{(1-\beta_{1,t})}(\theta_{t,i}-\theta_i^*)\right)

The last term can be bound using
|m_{t-1}^T(\theta_t-\theta^*)|\leq \frac{\|\alpha_t^{1/2}\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\alpha_t^{-1/2}\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2}
where $\hat V_t = \mathbf{diag}(\hat v_{t})$, giving
g_t^T(\theta_t-\theta^*) &\leq & \sum_{i=1}^d\left(\left((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta_i^*)^2\right)\frac{\sqrt{\hat v_{t,i}}}{2\alpha_t }\frac{1-\beta_1^t}{1-\beta_{t,1}}\,+\, \frac{\alpha_t \hat m_{t,i}^2}{2\sqrt{\hat v_{t,i}}} \frac{1-\beta_{1}^t}{1-\beta_{t,1}}\right) \\
&&+ \frac{\beta_{t,1}}{1-\beta_{t,1}} \left(\frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}\right)


From convexity , for $\theta^* = \min_\theta \sum_{t=1}^T f_t(\theta)$ we have
$$ f_t(\theta_t)-f_t(\theta^*)\leq g_t^T(\theta_t-\theta^*), $$
R(T) &\leq &   \sum_{t=1}^T\Bigg[\frac{1}{2\alpha_t} \frac{1-\beta_1^t}{1-\beta_{t,1}} (\|\hat V^{1/4}_t(\theta_{t}-\theta^*)\|_2^2 -\|\hat V^{1/4}_t(\theta_{t+1}-\theta^*)\|_2^2  ) \,+\, \frac{\alpha_t}{2} \frac{1-\beta_1^t}{1-\beta_{t,1}}\|\hat V_t^{-1/4}\hat m_t\|_2^2  \\&&+ \frac{\beta_{t,1}}{1-\beta_{t,1}} \left(\frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}\right)\Bigg]\\
&=& R_1(T)+R_2(T)+R_3(T) + R_4(T)


R_1(T) &=& \sum_{t=1}^T\frac{1}{2\alpha_t} (\|\hat V^{1/4}_t(\theta_{t}-\theta^*)\|_2^2-\|\hat V^{1/4}_t(\theta_{t+1}-\theta^*)\|_2^2 )\\
R_2(T) &=&\sum_{t=1}^T \frac{\alpha_t}{2}\frac{1-\beta_1^t}{1-\beta_{t,1}} \|\hat V_t^{-1/4}\hat m_t\|_2^2\\
R_3(T) &=&\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}} \frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} \\
R_4(T) &=&\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}}  \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}

Simple bounds

  • First, note that, element-wise, $$|v_{t,i}| = \sum_{\tau = 1}^t \beta_2^{t-\tau} g_{\tau,i}^2\leq G_\infty^2 \sum_{\tau’ = 0}^{t-1} \beta_2^{\tau’}  \leq  \frac{G_\infty^2}{1-\beta_2}$$
    where we use a change of variables $\tau’ = t – \tau$ and upper bound with the infinite geometric series.  So, $\|v_t\|_2 \leq d\|v_t\|_\infty \leq \frac{d G_\infty^2}{1-\beta_2}$.
  • Another important inequality. Consider two sequences $a_k\geq 0$ and $b_k\geq 0$. By Holder $p = 1$,
    \sum_k a_k = \sum_k \frac{a_k}{\sqrt{b_k}}\cdot \sqrt{b_k} \leq \left(\sum_j \frac{a_j}{\sqrt{b_j}}\right)\left(\max_l \sqrt{b_l}\right)\overset{\|\cdot\|_\infty \leq \|\cdot\|_2}{\leq} \left(\sum_j \frac{a_j}{\sqrt{b_j}}\right)\left(\sqrt{\sum_l b_l}\right)
    \frac{\sum_k a_k}{\sqrt{\sum_l b_l}}  \leq \sum_j \frac{a_j}{\sqrt{b_j}} \qquad\qquad(\oplus)
  • Lemma 2 in [1], fixed in Lemma 2 in [2]. We now need to put a bound on $\sum_t \frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}}$. In [1], they try to do a proof in which this term becomes $O(\sqrt{t})$; in [2], they tightened it to $O(\sqrt{\log(T)})$. In addition to being a better rate, [2] seemed to also have fewer bugs in the proof, so I think it makes sense to just present their version.
    \frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} = \sum_{i=1}^d\frac{\sqrt{1-\beta_2^t}(1-\beta_{1,0})^2}{(1-\beta_1^t)^2(1-\beta_2)}\frac{1}{\sqrt{t}}\frac{(\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j}g_{k,i})^2}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}
    Using Cauchy Schwartz,
    \sum_{k=1}^t(1-\beta_{1,0})\prod_{j=k}^t\beta_{1,j}g_{k,i}\leq \sqrt{\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j} \sum_{l=1}^t \prod_{j=l}^t\beta_{1,j}  g^2_{l,i}}
    so we may write
    \sum_{i=1}^d\frac{(\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j}g_{k,i})^2}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}} &=& \sum_{i=1}^d\frac{\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j} \sum_{l=1}^t \prod_{j=l}^t\beta_{1,j} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
    &=& \sum_{i=1}^d\frac{\sum_{k=1}^t\beta_1^{t-k} \sum_{l=1}^t \beta_1^{t-l} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
    &\overset{\sum_{k=1}^t\beta_1^{t-k}\leq \frac{1}{1-\beta_1}}{\leq}& \frac{1}{1-\beta_1}\sum_{i=1}^d\frac{\sum_{l=1}^t \beta_1^{t-l} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
    &\overset{(\oplus)}{\leq}& \frac{1}{1-\beta_1}\sum_{i=1}^d\sum_{k=1}^t\frac{ \beta_1^{t-k} g^2_{k,i}}{\sqrt{ \beta_2^{t-k}g_{k,i}^2}}\\
    &=& \frac{1}{1-\beta_1}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1\\
    Now, let’s sum over $t$.
    \sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&  \frac{\sqrt{1-\beta_2^t}(1-\beta_{1,0})^2}{(1-\beta_1^t)^2(1-\beta_2)(1-\beta_1)}\sum_{t=1}^T\frac{1}{\sqrt{t}}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1
    The next step is to use the assumption $\beta_1^2\leq \sqrt{\beta_2}$ to get
    \frac{\sqrt{1-\beta_2^t}}{(1-\beta_1^t)^2}(1-\beta_{1,0})^2 \overset{\beta_1^2\leq \sqrt{\beta_2}}{\leq} \frac{\sqrt{1-\beta_1^{4t}}}{(1-\beta_1^t)^2}(1-\beta_{1,0})^2 \overset{\beta_{1,0}\leq 1}{\leq}\frac{\sqrt{1-\beta_1^{4t}}}{(1-\beta_1^t)^2}\leq 1
    since the 1-D curve $g(x) = \frac{1-x^4}{(1-x)^4}$ is decreasing in the interval $0\leq x \leq 1$ and $g(0) = 1$. So, more simply,
    \sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{t=1}^T\frac{1}{\sqrt{t}}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1
    \end{eqnarray*}We now switch the $t$ and the $k$ summations, and use $\tau = t-k$. Then
    \sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{k=1}^T\sum_{t=k}^T\frac{1}{\sqrt{t}}\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1\\
    &\leq&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{k=1}^T\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau+k}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\|g_k\|_1\\
    &\overset{\tau+k\geq\tau}{\leq}&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\sum_{k=1}^T\|g_k\|_1
    Now we can just work with this term
    \sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\overset{C-S}{\leq}\sqrt{\sum_{\tau=0}^{T}\frac{1}{\tau}}\sqrt{\sum_{\tau’=0}^T\frac{ \beta_1^{2\tau’} }{ \beta_2^{\tau’}}}\leq \sqrt{\log(T)+1}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }}
    which gives an overall rate of
    \sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&   \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} \sqrt{\log(T)+1}\|g_{k,1:T}\|_1 \\&=& A \sqrt{\log(T)+1}\|g_{k,1:T}\|_1
    where we absorb the constant
    $$A= \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} .$$

Now we are ready to tackle the terms one by one. First,

Term $R_1(T)$

Looking at each element,

\sum_{t=1}^T\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} ((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta^*_i)^2 )&=&
\frac{\hat v^{1/2}_{1,i}}{2\alpha_1} (\theta_{1,i}-\theta_i^*)^2
-\frac{\hat v^{1/2}_{T-1,i}}{2\alpha_{T-1}} ((\theta_{t,i}-\theta_i^*)^2)\\
\sum_{t=2}^{T-1}\left(\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} -\frac{\hat v^{1/2}_{t-1,i}}{2\alpha_{t-1}}\right) (\theta_{t,i}-\theta_i^*)^2\\
\frac{\hat v^{1/2}_{1,i}}{2\alpha_1} (\theta_{1,i}-\theta_i^*)^2
-\frac{\hat v^{1/2}_{T-1,i}}{2\alpha_{T-1}} ((\theta_{T,i}-\theta_i^*)^2)

only if
\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} \leq\frac{\hat v^{1/2}_{t-1,i}}{2\alpha_{t-1}} \qquad\qquad (\star)

for all $t$! ($(\star)$ was actually the biggest point of [2], which corrects the proof in [1] by forcing this equation to be true in AMSGrad.)

This simplifies

R_1(T) \leq \frac{1}{2\alpha_1} \|\hat V^{1/4}_1(\theta_{1}-\theta^*)\|_2^2-\frac{1}{2\alpha_{T-1}}\|\hat V^{1/4}_{T-1}(\theta_{T}-\theta^*)\|_2^2 \leq \frac{1}{2\alpha} \|\hat V^{1/4}_1(\theta_{1}-\theta^*)\|_2^2 \leq \frac{D^2}{2\alpha}\frac{ \sqrt{G}}{\sqrt{1-\beta_2}}

Term $R_2(T)$

We want to use one of our “small bounds” results here. Explicitly,

R_2(T) = \frac{\alpha}{2} \sum_{t=1}^T \frac{1-\beta_1^t}{1-\beta_{1,0}\lambda^t}\frac{\|\hat V_t^{-1/4}\hat m_t\|_2^2}{\sqrt{t}} \leq \frac{\alpha}{2} \sum_{t=1}^T \frac{1}{1-\beta_1\lambda}\frac{\|\hat V_t^{-1/4}\hat m_t\|_2^2}{\sqrt{t}} \leq \frac{\alpha A}{2(1-\beta_1\lambda)} \sqrt{\log(T)+1}\|g_{k,1:T}\|_1

Term $R_3(T)$

I’m actually a little confused as to why neither paper seemed to really address what was going on with $R_3(T)$. In particular, if we expand,

R_3(T) =\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}} \frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2}

\|\hat V_t^{-1/4}m_{t-1}\|_2^2\, -\, \|\hat V_t^{-1/4}m_{t} \|_2^2 &=& \|\hat V_t^{-1/4}(m_{t-1}-m_t) \|_2^2  \,-\, 2(\hat V_t^{-1/4}m_{t})^T(\hat V_t^{-1/4}(m_{t-1}-m_{t}))  \\
&=& (1-\beta_{1,t})^2 \|\hat V_t^{-1/4}(m_{t-1}-g_t) \|_2^2 \, -\, 2(1-\beta_{1,t})(\hat V_t^{-1/4}m_{t})^T(\hat V_t^{-1/4}(m_{t-1}-g_{t}))
and $(1-\beta_{1,t})^2\to 1$ as $t\to\infty$. So it’s not like we can bound $R_3$ using the same tricks from $R_2$. On the other hand, assuming that $|\hat V^{-1/4}m_{t-1}\|_2^2 $ is a bounded quantity, and $ \frac{\beta_{t,1}}{1-\beta_{t,1}} $ is a summable sequence, it’s more likely that they just saw $R_3(T) \leq \text{constant}$ and just didn’t bother with it. Well, for completeness, let’s bother with it. First,
\|\hat V_t^{-1/4}m_{t-1}\|_2^2 &\leq & \|V_t^{-1/4}m_{t-1}\|_2^2 \\
&\leq& \sum_{i=1}^d \frac{(\sum_{\tau=1}^t\beta_{1}^\tau g_{i,\tau})^2}{\sqrt{\sum_{\tau=1}^t \beta_2^\tau g^2_{i,\tau}}}  \\
&\overset{\oplus}{\leq}&  \sum_{\tau=1}^t \frac{(\beta_{1}^\tau )^2}{\sqrt{ \beta_2^\tau }} \|g_t\|_1\\
&\overset{\|g_t\|_1\leq dG_\infty}{\leq}& dG_\infty \sum_{\tau=1}^t \frac{(\beta_{1}^\tau )^2}{\sqrt{ \beta_2^\tau }}\\
&\leq& \frac{dG_\infty\sqrt{ \beta_2}}{\sqrt{ \beta_2}-\beta_1^2} =:B

so we’re safe!

Term $R_4(T)$

Using $\alpha_t=\alpha/\sqrt{t}$.

R_4(T) &=&\sum_{t=1}^T   \|\hat V_t^{1/4}(\theta_t-\theta^*)\|_2^2\frac{\beta_{t,1}}{2\alpha_t(1-\beta_{t,1})}\\
&=&\frac{D^2}{2\alpha}\sum_{t=1}^T   \|v_t^{1/2}\|_2\frac{\beta_{t,1}\sqrt{t}}{(1-\beta_{t,1})(1-\beta_2^t)}

Note that, element-wise, $$|v_{t,i}| = \sum_{\tau = 1}^t \beta_2^{t-\tau} g_{\tau,i}^2\leq G_\infty^2 \sum_{\tau’ = 0}^{t-1} \beta_2^{\tau’}  \leq  \frac{G_\infty^2}{1-\beta_2}$$
where we use a change of variables $\tau’ = t – \tau$ and upper bound with the infinite geometric series. This gets us to

R_4(T) = \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}}\sum_{t=1}^T\frac{\lambda^t\beta_{0,1}\sqrt{t}}{(1-\lambda^t\beta_{0,1})(1-\beta_2^t)}\leq \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}}\sum_{t=1}^T\lambda^t\sqrt{t}

and using some fairly liberal tricks, for $\lambda \in (0,1)$:
\sum_{t=1}^T\lambda^t\sqrt{t} \leq \sum_{t=1}^T\lambda^t t \leq \int_0^\infty \lambda^t t dt = \frac{1}{(1-\lambda)^2}

we get
R_4(T) = \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}}\sum_{t=1}^T\frac{\lambda^t\beta_{0,1}\sqrt{t}}{(1-\lambda^t\beta_{0,1})(1-\beta_2^t)}\leq \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}(1-\lambda)^2}

Sidebar: After 3 days of pouring through [1] and aided by [2], I really feel like [2] is an underrated work, that really makes the theoretical analysis of [1] clean and persevering. If you think about it, at this point, Adam isn’t just one person’s genius idea, it’s a method that a ton of people are using every day, in a slightly different form than the one presented. So it’s not unreasonable to expect that the convergence analysis should be a group effort, not just the work of a single team. In that spirit…

AMSGrad [2]

In addition to applying band-aids all over the proof of [1], [2] also ended up proposing a new (slightly modified) method, which fixes the issue raised in ($\star$). The fix is pretty simple: using the same updates for $m_t$ and $v_t$, we apply elementwise
\hat v_t = \max(\hat v_{t-1},v_t)

and update

\theta_{t+1} = \theta_t – \frac{\alpha_tm_t}{\sqrt{\hat v_t}}.

This ensures that $(\star)$ is always true, and thus the proof holds well.

Followup works

It is clear also that I am not the first one on the bandwagon, of skepticism, powered by the 1: huge assumptions used in this proof and 2: no clear acceleration in terms of $t$ shown. It turns out that this bandwagon is actually pretty chock-a-full of optimizers singing more or less the same tune. There are more followup works being written to acknowledge the weaknesses of Adam in a better way, one of them being

  • [3] Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search) by Sharan Vaswani, Issam Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien

which argues that adaptive methods can use constant step sizes in the interpolation regime (where $\theta = \theta^*$ implies $\nabla f_t(\theta) = 0$ for all $t$), which seems strong but actually quite frequent in overparametrized neural networks.

Another notable work is

  • [4] A Simple Convergence Proof of Adam and Adagrad, by
    Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier

which extend the proof of Adam’s convergence to nonconvex functions, and show a rate of $O(\log(T)/\sqrt{T})$, matching Adagrad, and improving upon SGD by a factor of $1-\beta_1$. Curious as I am about the guts of this proof, I’m afraid that’s going to have to wait for another day.

Finally, a fun reference. Francesco Orobana also has a nice rant about this too in his blog:

The points here are a bit different from the points I was trying to make, in literally pulling out the weaknesses of Adam; rather, he is the opposite side of the coin which is that the numerical success of Adam is still quite surprising. (I agree.)

Over all, I believe the subject of Adam remains interesting, important, and controversial. Probably better proofs can still be made.


Thanks Sharan Vaswani for helpful discussion in dissecting this.

Leave a Reply