I have finally reached the point in my life when I have the misfortune of meeting not one, but two nonsymmetric “contraction” matrices, and have had to try to understand why it is that, even after all the hard work of proving that a matrix’s largest eigenvalue is $\lambda_{\max}\leq \rho <1$, it does not mean that we have a contraction rate of $\rho^k$. This journey includes not just a lot of bruises from banging one’s head against a wall, but also a (reunion) encounter with degenerate eigenvalues, algebraic vs geometric multiplicity, and the ever-present linear algebra embodiment of “speak softly and carry a big stick”, the Jordan canonical form.
(Apologies for rambling language. I haven’t had much sleep.)
Ok, so here is the problem. Let’s say I have a method, and through thick and thin I have finally proven that this method can be bounded in terms of something like
$$
x^{(k)}-x^* = A_k (x^{(k-1)}-x^*) = (\prod_{i\leq k} A_i)(x^{(0)}-x^*).\tag{Method}
$$
Then, someone way smarter than me has also shown that the largest eigenvalue of $A_k$, in modulus, is also $\rho <1$, for all $k$. Great! we say. We can conclude that $\|x^{(k)}-x^*\|_*\leq O(\rho^k)$! And to add insult to injury, all our simulations actually show this rate! So we’re done right?
Wait. No. $A_k$ is not symmetric. So, while it is true that
$$
\|x^{(k)}-x^*\|_2 = \| (\prod_{i\leq k} A_i)(x^{(0)}-x^*)\|_2\leq \|x^{(0)}-x^*\|_2,\tag{LinRate}
$$
we didn’t actually show that $\max_{i\leq k} \|A_i\|_2^k\leq \rho < 1$ (or that any $\|A_i\|_2\leq \rho$). So, now what do we do?
First approach: Gelfand’s formula.
Through some googling, we usually can find this formula:
$$
|\lambda_{\max}| = \lim_{k\to\infty}\|A^k\|_2^{1/k}
$$
or, more specifically,
$$
|\lambda_{\max}| \geq \lim_{k\to\infty}\|A^k\|_2^{1/k} – \epsilon_k, \qquad \epsilon_k\to 0.
$$
So, we can always just stop there and say that the rate (LinRate) is asymptotically true. This is somewhat unsatisfying, first because asymptotic rates are a bit wussy; but secondly, because we just kind of used some old guy’s rule without really understanding why it may be true.
Second approach: Complex diagonalization
It turns out that the issue we were facing above was not that $A_k$ is not symmetric, but that it is not necessarily diagonalizable. But, in cases where $A_k$ are diagonalizable, we actually don’t have that much to worry about.
Take, for example, the contraction matrix associated with the Polyak Heavy Ball method:
$$
A_k = \begin{bmatrix} (1+\beta) I -\eta H_k & -\beta I \\ I & 0 \end{bmatrix}
$$
and Polyak’s method can indeed be written like (Method) where $H_k = \nabla^2 f(x^{(k)})$. When the problem is quadratic, then $H_k$ is constant, but otherwise it may change with respect to $k$; but, under the right smoothness and strong convexity conditions, its eigenvalues will be bounded. Overall, with some smarts, we can show that $\lambda_{\max}(A_k) \leq \sqrt{\beta}$, and we really would like to say that this means $\|x^{(k)}-x^*\|_2 = O(\sqrt{\beta}^k)$. (This part of the analysis is well-known, was first shown by Polyak in [1], and actually I had discussed it in an earlier blog post.)
Now, suppose that $H_k$ is not constant, but satisfies the spectral bounds in our assumptions. Now, we are following the analysis in [2], which is cute, simple, and pretty powerful. Basically, we first observe that $H_k$ is symmetric and diagonalizable, and thus we may write $A_k$ :
$$
H_k = U_k\Sigma_kU_k^T, \qquad A_k = \begin{bmatrix} U_k & 0 \\ 0 & U_k \end{bmatrix}\underbrace{ \begin{bmatrix}(1+\beta)I-\eta \Sigma_k & -\beta I \\ I & 0 \end{bmatrix} }_{T_k}\begin{bmatrix} U^T_k & 0 \\ 0 & U^T_k \end{bmatrix}
$$
Furthermore, note that a permutation exists such that $\Pi T_k\Pi^T$ is block diagonal, with each block as the 2×2 matrix
$$
(\Pi T_k\Pi^T )_{2i-1:2i}= \begin{bmatrix}(1+\beta)-\eta (\Sigma_k)_i & -\beta 1 \\ 1 & 0 \end{bmatrix}
$$
and importantly, this matrix is diagonalizable! So, using the right combination of permutation matrices, we will be able to diagonalize $T_k = Q_kZ_kQ_k^{-1}$ where $Z_k$ is diagonal and $\sqrt{\beta} \geq \max_k |(Z_k)_{jj}|$. Note also that both $Q$ and $Z$ ought to have complex values (but that in itself is not problematic.)
Then, the rest is not too tricky. In the paper [2], they define $P_k = \begin{bmatrix} U_k & 0 \\ 0 & U_k \end{bmatrix}Q_k$ so that $A_k = P_kZ_kP_k^T$, and introduce a dummy variable $u_k = P_k^{-1}(x^{(k)}-x^*)$ so that
$$
u_k = P_k^{-1}A_k P_k u_{k-1}= Z_k u_{k-1} = \prod_{i\leq k}Z_iu_0
$$
and since $Z_i$ is diagonal with largest modulus value $\leq \sqrt{\beta}$, it is not hard to say $\|u_k\|_2\leq \sqrt{\beta}^k u_0$. In the paper they actually use an additional step to say $\|x_k-x^*\|_2\leq \kappa(P)\|u_k\|_2$ which adds a constant factor, but actually $P_k$ looks like it ought to be well conditioned in practice.
But, note that the hard part of this proof was to show that $T_k$ is diagonalizable (and therefore $A_k$ is diagonalizable) despite not being symmetric. Now, we have to ask ourselves, what happens if $A_k$ is not diagonalizable?
Third approach: try to remember some useful linear algebra facts.
Sometimes in the midst of some serious REM sleep, I’ll have reoccuring dreams of old math lectures. In one of them is a friendly graduate school professor chanting, in a comforting mantra: “not all matrices are diagonalizable… ” I then try to focus the memory a bit, to skip ahead a bit to the next part of the lecture, which says “…but the Jordan canonical form (JCF) always exists!”
To remember what this JCF thing is, let’s take as an example
$$
A = \begin{bmatrix} 1 & \alpha \\ 0 & 1 \end{bmatrix}.
$$
If we then look at its characteristic polynomial, then $det(A-\lambda I) = 0$ only when $\lambda = 1$, so it has two identical roots. But, we can also directly compute that
$$
A^k = \begin{bmatrix} 1 & k\alpha \\ 0 & 1 \end{bmatrix}.
$$
so clearly, its spectral radius is not bounded in the limit $k\to \infty$! Therefore, in this case, it would be absolutely incorrect to say (LinRate), for $\rho = 1$. However, if we just scale down the diagonal of $A$ ever so slightly, then
$$
A = \begin{bmatrix} \beta & k\alpha \\ 0 & \beta \end{bmatrix}, \qquad
A^k = \beta^k \begin{bmatrix} 1 & \frac{\alpha}{\beta} \\ 0 & 1 \end{bmatrix}^k
= \beta^k \begin{bmatrix} 1 & \frac{k\alpha}{\beta} \\ 0 & 1 \end{bmatrix}
$$
and since $\beta^k$ shrinks faster than $k\alpha/\beta$, this quantity will go to 0, asymptotically as $\rho(\beta^k)$, but not necessarily upper bounded by $\beta^k$.
Connection wiht Jordan canonical form
The JCF decomposition of a matrix $A$ is
$$
A = UJU^{-1}
$$
where $\|U\|_2 = 1$ and $J = \mathrm{diag}(J_1,J_2,…,J_D)$ contains diagonal blocks
$$
J_i = \begin{bmatrix} \lambda_i & 1 & 0 & … & 0 & 0 \\ 0 & \lambda_i & 1 & … &0 & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots
\\ 0 & 0 & 0 & … & \lambda_i&1
\\ 0 & 0 & 0 & … & 0 & \lambda_i\end{bmatrix}
$$
The blocksize of $J_i$ is the geometric multiplicity of the corresponding $i$th eigenvalue. This could be any number; 1 means it is the eigenvalue you already know and love, but > 1 means that there are “hidden eigenvector spaces”, where in fact the matrix spans more spaces but we can’t really reach them using an eigenbasis. That is the case of the scary matrix we had found. Note that, taking any one of these diagonal blocks, we can get
\begin{eqnarray*}
J_i^k
= \lambda_i^k\begin{bmatrix} 1& \binom{k}{1}\lambda_i^{-1} & \binom{k}{2}\lambda_i^{-2} & … & \binom{k}{b_i}\lambda_i^{-b_i} & 0 \\ 0 & 1& \binom{k}{2}\lambda_i^{-2} & … & \binom{k}{b_i-2}\lambda_i^{-b_i-2} & \binom{k}{b_i-1}\lambda_i^{-b_i-1}\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots
\\ 0 & 0 & 0 & … & 1& \binom{k}{1}\lambda_i^{-1}
\\ 0 & 0 & 0 & … & 0 &1\end{bmatrix}
\end{eqnarray*}
More precisely, I can write $\binom{k}{b} \leq \frac{k^b}{b!}$, which guarantees that, for $\lambda_i\leq 1$,
$$
\|J_i^k\|_\infty \leq \lambda_i^k \cdot \frac{ k^{b_i}}{b_i!}
$$
and therefore
$$
\|J_i^k\|_2 = \max_{u\neq 0} \frac{\|J_i^k u\|_2}{\|u\|_2} \leq \lambda_i^k \cdot \frac{ k^{b_i}}{b_i!} \cdot \underbrace{\max_{u\neq 0} \frac{|u\|_1 \|\mathbf{1}\|_2}{\|u\|_2}}_{b_i\sqrt{b_i}} = \lambda_i^k \cdot \frac{\sqrt{b_i} k^{b_i }}{(b_i-1)!}
$$
which overall gives an overall nonasymptotic convergence rate of
$$
\|x^{(k)}-x^*\|_2\leq \rho^k \max_i \frac{\sqrt{b_i} k^{b_i }}{(b_i-1)!}
$$
where $b_i$ are the Jordan blocks of $A$! (Nope, no constants!) Indeed this is $\|J_i^k\|\lesssim \lambda_i^k$ (but not nonasymptotically).
Conclusions and future thoughts.
So at this point, we’ve concluded that indeed, it is not a good idea to bound the action of “contraction” matrices (in quotes because they may not necessarily contract) based on the behavior of the largest eigenvalue; however, we can also see qualitatively how eventually, the spectral radius does give a linear rate. Note that this analysis can be done by using the ever-forgotten always-powerful JCF, which very explicitly points out when eigenvalues are problematic. (Not from symmetry, but from degeneracy.)
Overall (edited), we can actually come up with better nonasymptotic rates for nondiagonalizable contraction matrices, using the JCF. It feels a bit unsettling because it does not contain a “fully linear” overall trend, and depends critically on $b_i$, the order of the Jordan blocks, which may be hard to ascertain in practice. But, it is a nonasymptotic rate!
Well, hope y’all are enjoying your day!
Citations
[1]: Polyak, Boris T. “Introduction to optimization. optimization software.” Inc., Publications Division, New York 1 (1987): 32.
[2] Wang, Jun-Kun, Chi-Heng Lin, and Jacob D. Abernethy. “A modular analysis of provable acceleration via polyak’s momentum: Training a wide RELU network and a deep linear network.” In International Conference on Machine Learning, pp. 10816-10827. PMLR, 2021.
Many thanks to Sharan, Reza, and Baojian, whose discussions helped me understand this issue.