Hi all, it’s been a while! I took a step back from this blog to focus on teaching and research and such, but now that the last class is over for the fall, time to get back to this!

Today’s topic is now focused more on learning theory, which is an area I’m trying to learn more about. I’m going to cover Lemma 1 (and related topics) from [1], which was first introduced in [2]:

[1] A tutorial on Online Supervised Learning with Applications to Node Classification in Social Networks, by Alexander Rakhlin and Karthik Sridharan.

[2] Behavior of sequential predictors of binary sequences, by Thomas Cover.

Many thanks to Baojian for the topic introduction and Reza and Nikhil for discussions.

# The scenario

The goal here is sequence prediction. Let’s make life easy for us so far, and say that we are dealing with binary sequences, e.g. $y_t \in \{1,-1\}$ for $t = 1,… $. The goal is to estimate $\hat y_t\in \{1,-1\}$ based on observations $y_1,…,y_{t-1}$, using a distribution $p_t = \text{pr}(\hat y_t = 1)$, and we will represent this with a surrogate variable $q_t\in[-1,1]$ where $q_t = \mathbb E[\hat y_t]$ under the distribution $p_t$.

## Some examples

- Suppose that the sequence is extremely repetitive. That is, either $y_t = (1,1,1,1,…)$ or $y_t = (-1,-1,-1,-1,…)$. Then by $t = 2$, we can make perfect predictions by just picking $q_t = y_1$.
- Suppose that the sequence has repeating blocks of length $K$. That is, if $K = 3$, then a possible $y_t = (1,1,1,-1,-1,-1,-1,-1,-1,1,1,1,-1,-1,-1,…)$. Then we could at best do something with 60% accuracy, if we adopt the strategy of $\hat y_t = y_{t-1}.$ However, note that we could also construct an adversarial $y’_t = (1,-1,1,-1,1,-1,…)$ where this strategy would fail 100% of the time.

These examples show that in some real world prediction scenario, where there is some semblance of structure in the sequences, we might be able to come up with some reasonable algorithm. However, the existence of a not-terrible-algorithm depends on a *structural prior*, e.g. the set of allowable sequences must exist in some set smaller than $2^T$ where $T$ is the sequence length.

This concept also made more sense to me after I learned that these ideas were first explored by Thomas Cover in [2],** ** who primarily deals with codes over noisy channels. (The second example given is exactly a repetition code.) It makes sense to conclude that in the absence of any codes, no message can survive noise; on the other hand, enforcing some kind of coding technique limits the types of messages ultimately sent , and also limits the types of messages of which we need to “protect” from noise.

## Multiclass extension

In order to borrow most of the tools from the proofs in [1], the rest of the post will discuss this scenario in the multiclass framework, e.g. instead of $y_t\in \{-1,1\}$, rather we consider $y_t\in \{1,2,…,K\}$ for some $K\geq 2$. (Note that $K = 2$ reduces to the binary case.) In other words, feasible strings with some kind of structural prior to keep in your head may look like:

- (1,1,1,2,2,2,1,1,1,3,3,3,1,1,1,4,4,4,3,3,3,2,2,2)
- (cat, cat, cat, cat, cat, cat, cat, cat, cat) (as opposed to dog)
- (春,春,夏,秋,秋,秋,冬,春,夏,夏,夏,秋,秋,冬,冬,冬,冬)

You get the picture.

# The no-free-lunch theorem

The first theorem we explore is an “online version” of the no free lunch theorem. In particular, it says that for any algorithm $q_t$, if we do not impose a prior, then for binary sequences of length $T$, it must be that

$$\mathbb E_y \mathbb E_{\hat y\sim p}\left[\frac{1}{T}\sum_{t=1}^T \mathbb 1_{\hat y_t\neq y_t}\right] = \frac{1}{2}$$

and more generally, for $K$-class sequences, where both $y_t$ and $\hat y_t$ can pick values in $\{1,2,…,K\}$,

$$\mathbb E_y \mathbb E_{\hat y\sim p}\left[\frac{1}{T}\sum_{t=1}^T \mathbb 1_{\hat y_t\neq y_t}\right] = 1-\frac{1}{K}.$$

The proof for this is quite straightforward. The key observation is to see that there is no such thing as a “strong strategy $p$”, and in fact, for any $p$,

$$\mathbb E_{y_T} \mathbb E_{\hat y_T\sim p}\left[ \mathbb 1_{\hat y_t\neq y_t}| y_1,…y_{T-1}\right] = 1-\frac{1}{K}.$$

The second example above shows this tragedy; no matter what you predict, there exists an equal number of scenarios where that prediction was wrong. Applying this across all time $t = 1,…,T$ generalizes the result.

# Cover’s achievability lemma

But wait! Is this end of the story? Shall we despair and just call it quits? Nah, actually this is where Cover’s achievability theorem arises, and it’s quite cool. It basically says that while we cannot improve the *universal* accuracy of any algorithm, we *can* impose the *way* that it fails, by deciding which sequences we prioritize and which we throw in the gutter. (Think repetition codes = good, arbitrary sequences = bad.)

We call this goal function $\phi:\{-1,1\}^T\to [0,1] $, e.g. $\phi(y) = $ the probability that the sequence $y\in \{-1,1\}^T$ is guessed wrong. We say that $\phi$ is *stable* if for any two binary sequences $y_1\in \{-1,1\}^T$ and $y_2 \in \{-1,1\}^T$, if the Hamming distance between $y_1$ and $y_2$ is at most 1, then $|\phi(y_1)-\phi(y_2)|\leq \frac{1}{T}$. Written more explicitly,

$$ |\phi(…,1,…)-\phi(…,-1,…)|\leq \frac{1}{KT}. $$

Then Cover’s theorem says that this error distribution $\phi$ is achievable if and only if $\mathbb E_y[\phi(y)] \geq 1/2$ for binary sequences, and $\mathbb E_y[\phi(y)] \geq 1-1/K$ for multiclass sequences. Intuitively, this kind of makes sense, but there still is a somewhat finality in the statement, which says that while we can move things around, we really can never improve on our worst case scenario.

The rest of this blog post covers the proof of this achievability lemma, which is constructive in actually showing how one can define $q_t$ and $p_t$ given $\phi$.

# Proof of Cover’s lemma

## Water filling

Let us first consider the following multiclass min-max problem

$$\min_{q\in \Delta_K } \max_{y\in \{1,2,…,K\}} E_{\hat y\sim q} [ 1_{\hat y\neq y}] + R(y) \qquad (\star).$$

Now define a vector $r$ where $r_k = R(k)$, and assume that $r_i-r_j \leq 1/K$ for all $i,j$, e.g. stability condition. Define also $e_y$ the one-hot vector whose $y$th term is 1. Then $(\star)$ becomes

$$\min_{q\in \Delta_K } \max_{y\in \{1,2,…,K\}} (\mathbf{1}-q+r)^Te_y \qquad (\star\star)$$

and therefore the optimal $y = \arg\max_k\, 1-q_k+r_k$. Therefore if $q$ wants to minimize things, it will try to force the maximum value of $\mathbf{1}-q+r$ to be as small as possible, e.g.

$$(\star) \iff (\star\star)\quad \iff\quad \min_{q\in \Delta_K } \max_k \; 1-q_k+r_k.$$

This is accomplished by “water filling”:

- Start with $q_k = 0$ for all $k$.
- At each iteration, find all the indices $k = \arg\max_k \; 1-q_k+r_k$, and then fill up $q_k$ until a (or several) new indices $i$ are in the max.
- Keep doing this until $\sum_k q_k = 1$, e.g. you run out of “water supply”.

Mathematically, this means that $q_k =\max\{0, r_k-c\}$, where $c$ is chosen such that $\sum_k q_k =1$. See below an example where $K = 9$.

Note also that the stability condition enforces that $\sum_{i=1}^K r_{\max}-r_i \leq 1$, so in fact the flag $q_k = 0$ will never happen, and we can just write $q_k = r_k-c$. Explicitly, this means

$$ c = \frac{ \sum_{i=1}^K r_i }{K}- \frac{1}{K}$$

and

$$(\star)\quad =\quad 1+c = \frac{\sum_{i=1}^K r_i}{K}+\left(1-\frac{1}{K}\right) = \mathbb E_{y\in \{1,…,K\}} [R(y)] + \left(1-\frac{1}{K}\right).$$

## Work backwards.

Let us now introduce what everything is.

Let’s first write out our regret over a predetermined sequence $y$, as

$$\text{Total regret}(y) = E_{q_1}[\mathbb 1_{y_1\neq \hat y_1}] + E_{q_2|y_1}[\mathbb 1_{y_2\neq \hat y_2}] + E_{q_3|y_1,y_2}[\mathbb 1_{y_3\neq \hat y_3}] + \cdots + E_{q_T|y_1,…,y_{T-1}}[\mathbb 1_{y_T\neq \hat y_T}]$$

Again, we cannot reasonably bound the total regret over *all* choices of $y$, but we *can* (and want to) reasonably show that this total regret faction can be upper bounded by $T\cdot \phi(y)$, provided $\phi(y)$ has the correct stability condition, and has average $\mathbb E_y[\phi(y)] = 1-1/K$.

Suppose now that at iteration $t-1$, we had incurred $M_{t-1}(y_1,…,y_{t-1})$ mistakes. Then the expected number of total mistakes at step $t$ will be

$$M_t(y_1,…,y_t) = \mathbb E_{q_t}[\mathbb 1_{y_t\neq \hat y_t}] + M_{t-1}(y_1,…,y_{t-1}).$$

We impose that $M_T(y_1,…,y_T) = T\phi(y_1,…,y_T)$. Then, we work backwards, and see that by assigning $R_t(y_t) = -M_t(y_1,…,y_t)$, and using the previous waterfilling method (optimal $q$ for worst case $y$),

$$ \min_q\max_y \mathbb E_{q_t}[\mathbb 1_{y_t\neq \hat y_t}] – M_t(y_1,…,y_t)=\left(1-\frac{1}{K}\right) – \frac{\sum_{y_t} M_t(y_1,…,y_t)}{K}.$$

Therefore, using this strategy $q$,

$$-M_{t-1}(y_1,…,y_{t-1}) = \mathbb E_{q_t}[\mathbb 1_{y_t\neq \hat y_t}] – M_{t}(y_1,…,y_{t})

\leq \left(1-\frac{1}{K}\right) – \mathbb E_{y_t}[M_t(y_1,…,y_t)] .

$$

By telescoping, we get to

$$-M_0 \leq T\left(1-\frac{1}{K}\right) – \mathbb E_y[\phi(y_1,…,y_T)]$$

which, under the assumption that $\mathbb E_y[\phi(y_1,…,y_T)] \geq \left(1-\frac{1}{K}\right) $ implies that $M_0 \geq 0$, e.g. this solution is feasible. Since it was optimized at each step, it is also optimal.

# What’s the point?

First off, there is some beauty of course in knowing that while the no free lunch theorem says you can’t ever do better than random on average, you *can *decide how that randomness looks. In machine learning, this means you can force label patterns that don’t appear often to be always predicted wrong, but label patterns that appear all the time to be almost always correct. What’s more, the proof is constructive, and the sequence $q_t$ exactly gives the algorithm!

The other, more practical end of it is just that it makes proof writing for this type of problem easier in general. From now on, rather than providing algorithms and proofs, all I have to say is present a $\phi$ a function over all label sequences, and show that it satisfies

- stability: $\phi(…,i,..)-\phi(…,j,…)<\frac{1}{TK}$
- average error $\mathbb E_y[\phi(y)] \geq (1-1/K)$

and I know that an algorithm exists to achive this $\phi$! Then, taking expectations over a distribution more amenable to my data, say, $\mathcal D$, gives the new performance result!

- average error over all possible labels: $\mathbb E_{y\sim \mathrm{Unif}}[\phi(y)] = (1-1/K)$,
- regret I actually care about $\mathbb E_{y\sim \mathcal D}[\phi(y)]) = $ possibly much lower!

In other words, while there is in general no free lunch, if I know where to look I may still find some cheaper lunches lying around! (at the cost of expensive lunches elsewhere…)