Video time! Discretization methods on quadratic and logistic regression

Hi all, this is another “tiny project” that I’ve been wanting to do for a while. Lately, I’ve been somewhat obsessed with method flows and discretization methods. While we can sit here and write integrals until our faces turn blue, I think it’s time to just simulate some stuff, and just see what happens. In this post I explore 4 problems:

Problems

  • Quadratic optimization, somewhat ill conditioned (duh.)
  • Normalized quadratic optimization, to look at nonsmoothness and bad curvature.
  • Logistic regression, well separated
  • Logistic regression, poorly separated

The methods include

  • GD = Euler’s discretization (explicit)
  • MDPT = Midpoint method (explicit)
  • RK4 = 4th order Runge Kutta method (explicit)
  • BE = Backward Euler (implicit)
  • BDF2 = 2nd order Backward Differentiation Formula (implicit)
  • TZ = Trapezoidal rule (implicit).

Here, an approximate implicit method = 1 gradient step, and a full implicit method = 100 gradient steps.

 

 

Quadratic

$$f(x) = \frac{1}{2}x^TQx$$

 

 

Normalized quadratic

$$f(x) = \sqrt{x^TQx}$$

 

Logistic regression

In the logistic regression models,

$$f(\theta) = \frac{1}{m}\log(1+\exp(-y_ix_i^T\theta)$$

where we generate data using the 2-blob model

$$x_i = y_i c + z_i $$

and $\|c\|_2$ is tuned to create separation. Included is not just the trajectory over the landscape, but also the classification barrier, so you can see whether the suboptimality actually has real effect on bad classification or not.

 

Well separated logistic regression

 

 

 

 

Poorly separated logistic regression

 

 

 

 

Discussion

I had a few questions going into this tiny experiment, some of which were answered, some of which are still ephemeral.

  • Do implicit methods work? Answer:: Hard to say. They definitely do hug the flow better, and in the case of quadratic or normalized quadratic, they seem to help a lot with conditioning and being able to pick large step sizes. However, for logistic regression, they don’t really contribute well to margin orientation, which seems to be better affected by explicit noisy methods.
  • Are their approximate versions sufficient? Note that an approximate implicit method has the same per-iteration complexity as an explicit method. Answer: yes, the approximate methods are actually almost as good as the full ones, with regards to conditioning robustness and flow hugging!
  • Does it make sense to consider quadratic problems to represent machine learning tasks? Answer: This I kind of knew, but the answer is not really. Well, actually, it depends. For poorly separated blobs, a quadratic approximation is not bad, and seems to well-characterize what’s going on. But, on the other hand, for poorly separated blobs, the margin orientation is somewhat arbitrary and overfits on training data idiosyncrasies. On the other hand, when the blob is well separated, the step size choices, the flow characterization, all of that doesn’t seem to align well with margin orientation!

More studies need to be made in order to capture embedding learning as well, but anyway, it’s called a tiny project for a reason!

Finally, I leave you an image of our two favorite cheerleaders.

 

Leave a Reply