It turns out that some of my previous statements on PGD may not have been, well, correct. At first I thought it was just a few typos, but as I studied my notes further, I realized there may be some fundamental flaws. So this blog post revisits some ideas and hopefully irons out some past mistakes.

By the way, there aren’t any mistakes on past blog posts, but there is in my online SpeedOpto notes.

First, a review on how PGD works, and what is true about it. Consider the constrained optimization problem

$$\min_{x\in \mathcal D} f(x)$$

where $f$ is $L$-smooth, and $\mathcal D$ is a closed convex compact set. (For now I will not assume $f$ is convex, since it doesn’t really matter as much. However, $\mathcal D$ convex is important.) Then PGD takes steps as

$$x^{(k+1)} = \mathbf{proj}_{\mathcal D}(x^{(k)}-\alpha\nabla f(x^{(k)}))\tag{PGD}$$

where $\mathbf{proj}_{\mathcal D}$ is the Euclidean projection solution to the minimization problem

$$\mathbf{proj}_{\mathcal D}(x) = \arg\min_{y\in {\mathcal D}}\|y-x\|_2.$$

Previously, I made the statement that PGD cannot hurt GD, in the sense that

• PGD approaches a stationary point at least as fast as GD, and
• PGD has the same objective function decay rate as GD.

The first statement is still true. In particular, in the first case, we have that for any point in $\mathcal D$ (and in particular for $x^*\in \mathcal D$),

$$\|\mathbf{proj}_\mathcal D(x)-x^*\|_2\leq\|x-x^*\|_2$$

with equality only if $x\in \mathcal D$.

However, I am now realizing that we cannot so easily prove the second one. In fact, the usual statements for PGD and Mirror Descent are much weaker, namely, that for constant step size. $f(x)\to f(x^*)$ at a rate of $O(1/k) + O(\alpha \|\nabla f(x^*)\|_2^2)$ and taking $\alpha = 1/\sqrt{k}$ we may  get an aggregate $O(1/\sqrt{k})$ rate. (but it is not a just-in-time rate.)

Second however: if instead, we view the projection operation as a proximal operator of an indicator, we should get back our rate of $O(1/k)$. This I will explore in a follow up post.

What’s wrong with the descent lemma of PGD?

Let us simplify each step of PGD as

\begin{eqnarray*} y &=& x – \alpha \nabla f(x) \\ x^+ &=& \mathbf{proj}_{\mathcal D}(y).\end{eqnarray*}

Now, it’s tempting to try to write equations like this:

\begin{eqnarray*}
f(x^+)-f(x) &\leq& \nabla f(x)^T(x^+-x)  + \frac{L}{2}\|x^+-x\|_2^2\\
&\overset{+y-y}{=}& \nabla f(x)^T(x^+-y )+\nabla f(x)^T(y-x)  + \frac{L}{2}\|x^+-y\|_2^2 + L(x^+-y)^T(y-x) + \frac{L}{2}\|x-y\|_2^2\\
&\overset{y-x = -\alpha \nabla f(x)}{=}&-\alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2+ (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2
\end{eqnarray*}

However, at this point we actually get stuck, because in general, it is very hard to bound this residual term

$$r := (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2.$$

Previously, I tried to argue that this thing is negative ($r<0$); however, after drawing some pictures, it is clear that  $\nabla f(x)^T(x^+-y )$ will more often be positive (these two terms are aligned), and for $\alpha < 1/L$, $r>0$. In fact, the best bound I could come up with is actually, for $\alpha \leq 1/L$,

$$r \leq \alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2\tag{descent}$$

which shows descent, but you can’t get a rate from it. (See appendix.) This makes sense, since $f(x^+)$ is not in general less than $f(y)$, so you wouldn’t expect the same technique for GD to give as aggressive a rate (in terms of $f$) as PGD. In fact, usually you would expect the opposite; $y$ steps in the maximally descending direction, but $x^+$ is a fix that brings back feasibility and can only hamper progress in objective value decay.

As a counterexample, first consider something like $f(x) = (x-1)^2$ over the constraint $x \geq 0$, and suppose that we already are at $x^{(k)} = 0$. Clearly, $\|\nabla f(x^{(k)})\|_2 = 2 > 0$, but hereonafter, all future iterates will have the same objective value. So we really can only show descent, not a descent amount that’s related to the gradient of $f$.

the proof (corrected)

After knocking my head around a bit, I have come to the conclusion that the best way to go around this is to actually use the (normal, accepted) way of the mirror descent proofs. First, we need two important inequalities in our arsenal: the 3 point equality

$$2(x-y)^T(x-z) = \|x-y\|_2^2 + \|z-x\|_x^2 – \|z-y\|_2^2\tag{3 pt =}$$

and the “projections bring me closer to the set” inequality (see past posts)

$$\|y-x^*\|_2 \geq \|x^+-x^*\|_2^2\tag{closer}$$

\begin{eqnarray*}
f(x)-f(x^*) &\leq& \nabla f(x)^T(x-x^*)  \\
&=& \frac{1}{\alpha}(x-y)^T(x-x^*)\\
&\overset{\text{(3 pt =)}}{=}&\frac{1}{2\alpha}(\underbrace{\|x-y\|_2^2}_{\alpha^2\|\nabla f(x)\|_2^2} + \|x^*-x\|_2^2 – \|x^*-y\|_2^2)\\
&\overset{\text{(closer)}}{\leq}&\frac{1}{2\alpha}(\alpha^2\|\nabla f(x)\|_2^2 + \|x^*-x\|_2^2 – \|x^*-x^+\|_2^2)
\end{eqnarray*}

and after telescoping, we get

$$\frac{1}{k}\sum_{i=1}^k(f(x^{(i)}) – f(x^*))\leq \frac{\alpha}{2}\underbrace{\frac{1}{k}\sum_{i=1}^k\|\nabla f(x^{(i)})\|_2^2 }_{\overline{\nabla f(x^{(k)}}} + \frac{\|x^*-x^{(0)}\|_2^2 – \|x^*-x^{(k)}\|_2^2}{2k\alpha}.$$

This averaged gradient term (AG) converges to $\frac{\alpha}{2}\|\nabla f(x^*)\|_2^2$ which is not 0 in general. Recall also that we showed PGD descends, so that $f(x^{(i+1)}) \leq f(x^{(i)})$. Thus we can conclude

$$f(x^{(i)}) – f(x^*)\leq \frac{\|x^*-x^{(0)}\|_2^2 – \|x^*-x^{(k)}\|_2^2}{2k\alpha} + \frac{\alpha}{2} \|\overline{\nabla f(x^{(k)}})\|_2^2\to O(1/k) + O(\alpha \|\nabla f(x^*)\|_2^2).$$

Appendix

proof that

$$r= (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2 \leq \alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2$$

Actually, this is just a consequence of the definition of projection, which says that, since $x\in \mathcal D$, then

$$\|x^+-y\|_2 \leq \|x-y\|_2 =\alpha \|\nabla f(x)\|_2.$$

Therefore, if $\alpha \leq \frac{1}{L}$, then using Cauchy Schwartz,

\begin{eqnarray*}
r&=&(1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2 \\
&\overset{\text{C-S}}{\leq} &(1-\alpha L)\|\nabla f(x)\|_2\|x^+-y\|_2 + \frac{L}{2}\|x^+-y\|_2^2\\
&\overset{\text{closer}}{\leq}&(1-\alpha L)\|\nabla f(x)\|_2\|x-y\|_2 + \frac{L}{2}\|x-y\|_2^2\\
&\overset{\|x-y\|_2=\alpha\|\nabla f(x)\|_2}{\leq}&\alpha (1-\frac{\alpha L}{2})\|\nabla f(x)\|_2^2
\end{eqnarray*}