Tangoing with nonsymmetric contraction matrices: Understanding Gelfand’s formula and its implications on linear convergence rates

I have finally reached the point in my life when I have the misfortune of meeting not one, but two nonsymmetric "contraction" matrices, and have had to try to understand why it is that, even after all the hard work of proving that a matrix's largest eigenvalue is $\lambda_{\max}\leq \rho <1$, it does not mean that … Continue reading Tangoing with nonsymmetric contraction matrices: Understanding Gelfand’s formula and its implications on linear convergence rates →

High dimensional mean value theorem.

Here's a new "overthinking it" issue for you. For a while now, I've been relying on the high-dimensional version of a Taylor series approximation result to help me flip variables and gradients. That is, for 1-D functions with continuous Hessians everywhere, the mean value theorem says that there always exists a $z$ where $x\leq z … Continue reading High dimensional mean value theorem. →

We need to talk about Adam

At some point in every optimist's life, we are going to be asked the age-old question: "How does your method compare with Adam?" or, even more annoyingly, "Does this analysis help understand Adam?" I mean, Adam is a nice method that works very well for a very unique set of optimization problems, and not, like, … Continue reading We need to talk about Adam →

Projected gradient descent, revisited

It turns out that some of my previous statements on PGD may not have been, well, correct. At first I thought it was just a few typos, but as I studied my notes further, I realized there may be some fundamental flaws. So this blog post revisits some ideas and hopefully irons out some past … Continue reading Projected gradient descent, revisited →

Video time! Discretization methods on quadratic and logistic regression

Hi all, this is another "tiny project" that I've been wanting to do for a while. Lately, I've been somewhat obsessed with method flows and discretization methods. While we can sit here and write integrals until our faces turn blue, I think it's time to just simulate some stuff, and just see what happens. In … Continue reading Video time! Discretization methods on quadratic and logistic regression →

Some proofs of continuous time things

In trying to understand how continuous time interpretations can help optimization research, I find myself doing a lot of "recreating the wheel", but in a nontrivial way; I am getting these proofs that I'm sure are standard knowledge, but it's not like we learned them back when we were kids. So I'm going to try … Continue reading Some proofs of continuous time things →

A fun convex set fact

This is actually not a new thing at all, but a thing I played with a couple years ago, and I ended up using in a numerical optimization lecture to try to impress some undergrads. They were not impressed. Since then I've tuned it a bit and just kind of take it for granted now, … Continue reading A fun convex set fact →

Small proof: $L$-smoothness definitions and equivalences

tl;dr In a former version of this post, I conjectured that while $L$-smoothness and convexity were not equivalent in definition, they were tied in some of the ``equivalent" (not equivalent) versions of their statements. To clarify, my former statement: "Convexity and smoothness are totally independent concepts and should not be confused with each other " … Continue reading Small proof: $L$-smoothness definitions and equivalences →

Small proof: The gradient norms in gradient descent descends.

I've been racking my brain lately if there is an appropriate place to insert "small proofs", e.g. stuff that isn't trivial, I can't find it anywhere, but isn't cool enough to inspire a whole new paper. For now, I think I'll put it here. Background We investigate $\underset{x}{\min}\, f(x)$ using gradient descent $$ x^{(k+1)} = … Continue reading Small proof: The gradient norms in gradient descent descends. →

Newton’s method II : Self concordant functions

tl;dr: Self-concordance. definition, some examples, some interesting properties. Proof of Newton's rates left as exercise for reader 🙂 An interesting generalization of strongly convex functions are these self-concordant functions, which can be defined as $$ |D^3 f(x)[u,u,u]|\leq 2M\|u\|_{\nabla^2 f(x)}^2 $$ Here, the notation $$D^3 f(x) [u,v,w] = \sum_{i,j,k}\frac{\partial^3 f(x)}{\partial x_i \partial x_j \partial x_k} \, … Continue reading Newton’s method II : Self concordant functions →