Post by Yifan Sun
In comparison to past posts, this post is really about some (relatively) recent work by a bunch of people ([1],[2],[3] to name a few). I therefore tried to spend a lot less time detailing how these things work, as these authors have posted their own excellent blog posts and presentations; however, some of it was necessary to make the point I wanted to make later. The goal here is to use PEP to understand the rates of the three first-order methods, VGD, PHB, and NAG.
tl;dr PEP is a construct to create automatic proof generators for methods by formulating the proof itself as the solution to a QCQP, whose semidefinite optimization relaxation turns out to be tight. It verifies the big-O complexity for the three methods (VGD, PHB, NAG), but can sometimes give tighter constants. I give here some performance examples on VGD, PHB, NAG, and show how the dual variables can be used to pull out the worst-case problem for VGD (verifying that the guarantee is tight and attainable).
The Procedure
Imagine the convex minimization problem $$\min_{x\in \mathbb R^d}\; f(x)$$ where we run our methods for $T$ iterations. For all three first order methods, we can summarize them as $$X = G A + x_1\mathbf 1^T$$ where $X = [x_1,x_2,…,x_{T}]\in \mathbb R^{d\times T}$, $G = \frac{1}{L}[\nabla f(x_1), \nabla f(x_2), …, \nabla f(x_T)]$, and $A$ describes the method iteration scheme. For example, taking $T = 5$, the VGD rule can be summarized simply as,
$$
A_{VGD} = \left[\begin{matrix} 0 & -1 & -1 & -1 & -1 \\ 0 & 0 & -1 & -1 & -1 \\0 & 0 & 0 & -1 & -1 \\0 & 0 & 0 & 0 & -1 \\0 & 0 & 0 & 0 & 0\end{matrix}\right]$$
and
$$
A_{PHB} = \left[\begin{matrix} 0 & -1 & -1 – \beta & -1-\beta-\beta^2 & -1-\beta-\beta^2-\beta^3 \\ 0 & 0 & -1 & -1 – \beta & -1- \beta – \beta^2 \\0 & 0 & 0 & -1 & -1 – \beta \\0 & 0 & 0 & 0 & -1 \\0 & 0 & 0 & 0 & 0 \end{matrix}\right].$$
It’s a bit tricker to do this for $A_{NAG}$ at least symbolically, but numerically, we can arrive at
$$
A_{NAG} = \left[\begin{matrix} 0 & -1 & -1 & -1 & -1\\
0 & 0 & -1.2818 & -1.4040 & -1.4690\\
0 & 0 & 0 & -1.4340 & -1.6645\\
0 & 0 & 0 & 0 & -1.5311\\
0 & 0 & 0 & 0 &0 \end{matrix}\right].
$$
(I’m sure there is also a nice closed form expression for $A_{NAG}$, and I have a nagging feeling it isn’t pretty.) But the point is that they exist, and they are strictly upper triangular, e.g. the system $X = GA$ is explicit.
Now, our goal is to solve a primal (nonconvex, inifinite-dimesional) optimization problem abstractly defined as
$$
\begin{array}{ll}
\underset{f,x_i,x^*}{\mathrm{maximize}} & f(x_T)\\
\mathrm{subject~to} & X = GA\\
& f\in \mathcal C_L^{1,1}
\end{array}$$
where $\mathcal C_L^{1,1}$ are the class of $L$-smooth convex functions mapping $\mathbb R^d\to \mathbb R$. Unless there’s some secret fairy godmother out there granting magic wishes, this problem is not tractable.
So, what PEP proposes to do instead, is to take this difficult constraint $f\in \mathcal C_L^{1,1}$, and replace it with sufficient conditions; namely, they impose some known inequalities at only the “important” points. Specifically, in [1], the inequality of choice is $$ f(y) – f(z) \geq \nabla f(z)^T(y-z) + \frac{1}{2L}\|\nabla f(y)-\nabla f(z)\|_2^2. \qquad (\triangle)$$ Later, they show how to integrate a whole host of nice inequalities, which really generalizes the methodology, but since I am uncreative by nature, I decided to just use this one, especially since it summarizes $L$-smoothness and convexity all in one nifty package. The inequality $(\triangle)$ is then imposed on all $\tfrac{(T+1)T}{2}$ pairs $(y,z)\in \{x_1,…,x_T,x^*\}^2$. Then, the abstract concepts $f(x)$ and $\nabla f(x)$ are replaced by scalar variables $\delta_i = \tfrac{1}{L}(f(x_i)-f^*)$ and $g_i = \tfrac{1}{L}(\nabla f(x_i))$ for $i = 1,…,T$. This problem, though still nonconvex, is now finite-dimensional. (The authors in [1],[2],[3] call problems satisfying these constraints $\mathcal C_L^{1,1}$-interpolatable.)
$$
\begin{array}{ll}
\underset{\delta_i,g_i,x_i,x^*}{\mathrm{maximize}} & \delta_T\\
\mathrm{subject~to} & X = GA\\
& \delta_i-\delta_j \geq g_j^T(x_i-x_j) + \frac{1}{2L}\|g_i-g_j\|_2^2,\; i,j\in \{1,…,T,*\}.
\end{array}
\qquad\text{(P)}$$
The dualization is then done by defining some Lagrange variables
$$\begin{array}{rrcl}
u_t: & -\delta_t &\geq& \frac{1}{2L}\|g_t\|_2^2\\
v_t: &-\delta_t &\geq& g_t^T(x_t-x^*) + \frac{1}{2L}\|g_t\|_2^2\\
W_{i,j}:&\delta_i-\delta_j &\geq& g_j^T(x_i-x_j) + \frac{1}{2L}\|g_i-g_j\|_2^2,\;i\neq j.
\end{array}$$
I’ll leave it up to the papers to give the full derivation, but there are two very cute steps that I think are nontrivial.
- After minimizing out the $\delta_i$ terms, the Lagrangian appears like
$$
\mathcal L=\mathbf{tr}(GMG^T)-v^TG^T(x_1-x^*)
$$
which implies that the optimal (read: worst case) $G$ is rank-1, with row space determined by $x_1-x^*$; e.g. $G = \xi(x_1-x^*)^T$. - The problem is then transformed to an SDP with the nifty observation that
$$
\min_{\xi}\; \xi^TM\xi – v^T\xi = \eta \quad \iff \quad \left[\begin{matrix}M & -v/2\\-v^T/2 & \eta \end{matrix}\right]\text{ is PSD}
$$
This finally gives the (dual) semidefinite optimization problem
$$
\begin{array}{ll}
\underset{{\eta,u,v,W}}{\mathrm{minimize}} & \eta \cdot \frac{\|x_1-x^*\|_2^2}{L}\\
\mathrm{subject~to} & e_T = v-u+\sum_j\sum_{k\neq j} (W_{k,j}-W_{j,k})e_j\\
&\left[\begin{matrix}M & -v/2\\-v^T/2 & \eta \end{matrix}\right]\text{ is PSD}\\
& u\geq 0, v\geq 0, W\geq 0
\end{array}
\qquad\text{(D)}$$
where $M = \frac{1}{2}M_0M_0^T$ for
$$
M_0 = \sum_{t=1}^T \frac{u_t+v_t}{2} e_te_t^T – v_tAe_te_t^T + \sum_{i\neq j} W_{ij} A (e_i-e_j)^Te_j + \frac{1}{2}(e_ie_i+e_je_j-e_ie_j-e_je_i).
$$
Here $e_k = [\underbrace{0,…,0}_{k-1\text{ $0$s}},1,0,…,0]^T$ the standard basis.
The worst problem
I’ll give the procedure for finding the worst case problem for VGD, which at the very least gives some insight as to how simple these “worst” problems can be. I am still in the process of figuring out PHB and NAG, so I will leave those questions hanging.
- First, note that dimensionality never came into play; none of the dual variables depend on $d$. Therefore, we can just assume $d =1$, e.g. all the variables $x_i$ are scalars.
- Second, recall that the optimal $G$ is rank 1. That means that in the worst case problem, we can assume that all the gradients $\nabla f(x_i) = c_i(x_1-x^*)$, where $c_i$ is just a scalar quantity! This further reduces the search space.
- Finally (and this comes from the optimization problem) In the VGD case, the dual variables can be observed to be optimized at $u = 0$ and $v_i = 0$ for all $i < T$, and $v_T = 1$. The $W$ terms are trickier; numerically, I did not observe much sparsity in $W_{ij}$ for off-diagonal values ($W_{ii} = 0$ because there are no constraints for $i =j$ encoded.) However, for VGD, if you explicitly set $W_{ij} = 0$ for all $|i-j|\geq 2$, you will not see any difference in the objective value achieved! Therefore, you know that the proof over a worst case problem can exist involving only inequalities
\begin{eqnarray*}
f(x_t)-f(x_{t+1}) &\geq& \nabla f(x_t)^T(x_t-x_{t+1}) + \frac{1}{2L}\|\nabla f(x_t)-\nabla f(x_{t+1})\|_2^2, \; \quad t = 1,…,T\\
f(x_{t+1})-f(x_{t}) &\geq& \nabla f(x_{t+1})^T(x_{t+1}-x_{t}) + \frac{1}{2L}\|\nabla f(x_t)-\nabla f(x_{t+1})\|_2^2, \; \quad t = 1,…,T\\
f(x_T)-f^* &\geq& \nabla f(x_T)^T(x_T-x^*) + \frac{1}{2L}\|\nabla f(x_T)\|_2^2
\end{eqnarray*}
and moreover, because of complementary slackness, these inequalities can all be set to equality in the worst case problem!
Putting these altogether, I have the following 2 inequalities:
\begin{eqnarray*}
f(x_{t+1})-f(x_t) &=& \frac{1}{2L}(c_t^2-2c_tc_{t+1}+c_{t+1}^2)\\
f(x_t)-f(x_{t+1}) + \frac{1}{L} c_tc_{t+1} &=& \frac{1}{2L}(c_t^2-2c_tc_{t+1}+c_{t+1}^2)
\end{eqnarray*}
which implies that
$$c_t^2+c_tc_{t+1}=0 \iff c_t = 0 \text{ or } c_t=c_{t+1}$$
To avoid trivial functions, this tells me that, for all my discretization points (not including $x^*)$, I have $$\nabla f(x_t) = c_t(x_t-x^*) = c(x_t-x^*)$$. Without loss of generality I can pick $x^* = 0$ and it is clear that I have a function that, for all the points sampled, is linear with slope $c$.
Note that simply ending the story with $f(x) = cx$ is not sufficient, as this function has $L\to +\infty$, and additionally $x^* \neq 0$. The rest of this part took a bit of trial and error, but if you pick this Huber function
$$ f(x) = \begin{cases} \frac{x^2}{2} & x\leq \alpha \\ \alpha x – \frac{\alpha^2}{2} & x \geq \alpha\end{cases}$$
and in particular set $\alpha = x_T/T$, you can control the slope and force $L=1$ at the same time, achieving the exact worst case bound predicted by PEP. Therefore, this bound is tight.
With that, I leave you with a cute photo of my new houseguest. #ratsarethebest
Thanks to Adrien who introduced me to this tool and kept nudging me to look at it more closely
References
[1] Drori, Y, and Teboulle, M. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming (2014).
[2] Taylor, A.B., Hendrickx, J.M. and Glineur, F. Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In Annual Conference on Decision and Control (CDC) (2017)
[3] Drori, Y. and Taylor, A.B., Efficient first-order methods for convex minimization: a constructive approach. Mathematical Programming (2020)
[4] Su, W., Boyd, S. and Candes, E.J. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. JMLR (2016).