I’m almost sure this converges (to a random variable)

It’s always nice to have friends who encourage you to revisit old concepts that you had seen before as a student, and at that time had just concluded “yeah, there’s no way I can ever understand that”, but now are ~~forced~~ encouraged to confront it a second time. So today’s post is going to be a bit of a walk down memory lane, down a pernicious concept that I really don’t think I’m alone in finding a bit subtle for a first-year graduate student: the different types of convergence to a random variable.

Specifically, consider a sequence of random variables $X_1,…,X_n$. Then we are discussing the convergence of $X_n\to X$, another random variable. This is the first nontrivial fact that I had to remember: these convergence properties are about a sequence of distributions converging to another distribution. Hence, it is not 100% clear what the word “converge” means, without more context.

Ok so here it is: we have four types of convergence:

almost sure convergence: $\mathrm{Pr}(\lim_{n\to\infty} X_n = X) = 1$
convergence in mean square: $\mathbb E[(X_n-X)^2]\to 0$
convergence in probability: for every $\epsilon > 0$, $\mathrm{Pr}(|X_n-X|>\epsilon) \to 0$
convergence in distribution: $\lim_{n\to\infty} p_{X_n}(x) = p_X(x)$, where $p_Z$ is the p.d.f. of the random variable $Z$.

The chain of implications are also well known, in particular:

$\{$conv. almust surely, conv. mean square$\}$ $\Rightarrow$ conv. in probability $\Rightarrow$ conv. in distribution

That is, either 1 or 2 implies 3, which in turn implies 4.

What I would like to do is to now do a deep dive into each of these types of convergences, to try to understand the implications, and why indeed they are only one way.

Convergence in distribution vs everybody else

Probably the first one we should get out of the way is that, the convergence of a distribution is fundamentally different than the convergence of the random variable. The convergence of a distribution simply says that the p.d.f. of the random variables $X_n$ converges to a limiting p.d.f. $p_X$. But it does not expect the values drawn from $X_n$ to necessarily converge to the values drawn from $X$.

A simple example of this is the distribution of $X_n$ converging to a uniform distribution over $[-1,1]$. Then, no matter how large $n$ is, we cannot predict what exact value will be drawn from $p_{X_n}$ with any meaningful probability. So it is not possible for a value drawn from $X_n$ to converge to a value drawn from $X$ (as it is not even convergent to a single value!)

So, in all the cases except convergence in distribution, we actually are still taking about $X_n$ converging to a fixed value. We may not know what that value is, there may be much uncertainty associated with that value, but it exists and is modeled by $X$. An example of this is the law of large numbers, where $X_n$ is the sample mean of an i.i.d. drawn sample of size $n$. Then $X$ is the true mean, and $X_n\to X$ in probability (according to the weak law), although we do not necessarily know what the true mean is.

Almost sure convergence vs convergence in probability

Here, I find it’s useful to really break down both convergences in terms of their definitions. Specifically,

$$
\forall \epsilon, \quad \exists \bar n, \quad \forall n \geq \bar n, \quad \text{Pr}(|X_n-X|\leq \epsilon) = 1 \tag{almost sure, a.s.}
$$

$$
\forall \epsilon, \forall \epsilon_2, \quad \exists \bar n, \quad \forall n \geq \bar n, \quad \text{Pr}(|X_n-X|\leq \epsilon) \geq 1-\epsilon_2 \tag{in probability, i.p.}
$$

Actually, by re-writing their definitions explicitly, it is already clear why a.s. implies i.p., but i.p. does not imply a.s.

To understand their differences, let’s consider the following scenario. Suppose that $X_n$ is the random variable that gives the average test scores of a class of students, and $n$ represents the number of times they’ve already taken this test (so $n = 1,2,…,\infty$). You would hope that after $n>\bar n$, for some very large value of $\bar n$ trials, these scores are going to get really close to 100%. And indeed, they converge to 100% a.s.

(Converges in all the senses)

But, now suppose that in the class, there are a small number of androids that are pretending to be human, who are taking the test. Unlike humans, these androids are not that easy to teach, so they are consistently getting a score of 50% — no matter how many times they take the test, they maintain this score. Over time, you start to identify which students are human and which are androids, and you systematically kick them out of your class, but the number of androids never actually reaches 0. So, now your average test scores is a mix of two random variables: $X_n\overset{a.s.}\to X = 100%$ and $Y = 50%$. Then the random variable that describes your average test score is $Z_n =(1- \beta_n) X_n + \beta_n Y$. Specifically, you can consider $\beta_n = 1/n$, a decaying sequence. Then, almost all of the time, $Z_n$ does what $X_n$ does, and thus converges to $X$, but not with probability 1. However, it is true that $Z_n$ converges in probability to $X$.

(Converges in probability, but not almost surely)

Convergence in mean square vs convergence in probability

Again, it helps to first write down the definitions.

$$
\forall \epsilon, \quad \exists \bar n, \quad \forall n \geq \bar n, \quad \mathbb E[(X_n-X)^2] \leq \epsilon \tag{mean square, m.s.}
$$

$$
\forall \epsilon_1, \forall \epsilon_2, \quad \exists \bar n, \quad \forall n \geq \bar n, \quad \text{Pr}(|X_n-X|\leq \epsilon_1) \geq 1-\epsilon_2 \tag{in probability, i.p.}
$$

This actually is a consequence of Markov’s inequality, which states that $\text{Pr}(X\geq a)\leq \frac{\mathbb E[X]}{a}$. Then, for any $\epsilon_1,\epsilon\geq 0$,

$$
\text{Pr}(|X_n-X|\geq \epsilon_1) =\text{Pr}((X_n-X)^2\geq \epsilon_1^2) \leq \frac{\mathbb E[(X_n-X)^2]}{\epsilon_1^2} \leq \frac{\epsilon}{\epsilon_1^2}.
$$
Picking $\epsilon_2 = \frac{\epsilon}{\epsilon_1^2}$ completes the proof.

Note that although m.s. is stronger than i.p., m.s. isn’t really comparable to a.s. Consider again the previous example of the android and students taking tests. This example actually does converge m.s, since

\begin{eqnarray*}
\mathbb E[(Z_n-X)^2] &=& \mathbb E[(X_n-X)^2]\text{Pr}(Z_n=X_n) + \mathbb E[(Y-X)^2]\text{Pr}(Z_n=Y) \\
&=& \underbrace{\mathbb E[(X_n-X)^2]}_{\overset{n\to\infty}{\to} 0}\cdot (1-\beta_n) + (0.5)^2 \beta_n \\
&\overset{n\to\infty}{\to}& 0 \cdot 1 + 0.25 \cdot 0 = 0.
\end{eqnarray*}

However, note the key subtlety in m.s. convergence — it doesn’t really care about events that occur with diminishing probability (which a.s. does care about), but it does care about how badly the convergence is, e.g. it really needs the error to diminish to 0. So, let’s break this by giving the android test takers more power. Now, consider $Y_n = -n^2-1$, with probability 1. (I don’t know, maybe they’re harming humanity as they are taking the tests, so we give them negative scores.) Then, again taking $\beta_n = 1/n$,

\begin{eqnarray*}
\mathbb E[(Z_n-X)^2] &=& \mathbb E[(X_n-X)^2]\text{Pr}(Z_n=X_n) + \mathbb E[(Y_n-X)^2]\text{Pr}(Z_n=Y_n) \\
&=& \underbrace{\mathbb E[(X_n-X)^2]}_{\overset{n\to\infty}{\to} 0}\cdot (1-\beta_n) +\underbrace{n^2 \beta_n}_{=n} \\
&\overset{n\to\infty}{\to}& 0 \cdot 1 + 0.25 \cdot \infty
\end{eqnarray*}

This example still converges in probability, but does not converge in mean square.

(Converges in probability, but not in mean square, and not almost surely)

Convergence almost surely vs in mean square

Note that that last testing example (with expanding values in $Y_n$) did not converge m.s. Since androids are still invading the test site, it also does not converge in a.s. But is it possible for a sequence to converge a.s. and not in m.s.?

This one really stumped me for a while. As much as I was working it out, I couldn’t get it to work. Specifically, the calculation I thought about that didn’t work:

\begin{eqnarray*}
\mathbb E[(X_n-X)^2] &=& \mathbb E[(X_n-X)^2 \, \mid \, |X_n-X| \geq \epsilon]\text{Pr}(|X_n-X|\geq \epsilon) \\&&\qquad + \mathbb E[(X_n-X)^2 \, \mid \, |X_n-X| \leq \epsilon]\text{Pr}(|X_n-X|\leq \epsilon) \\
&\leq& \epsilon^2 \cdot 1 + ?? \cdot 0 = \epsilon^2.
\end{eqnarray*}

Honestly, I couldn’t think through a better way and the texts I was using was not that informative so, I did the dreaded thing. I asked ChatGPT for a better explanation. And so… I feel like I have a better explanation now, but… it isn’t, well, source-verified.

Basically, the idea is pointwise convergence vs uniform convergence, and importantly, over the random variable $X$. At this point, we actually don’t really want to treat it as a constant, but as a random variable of which we do not know, and we also have to integrate over that, e.g.

$$
\mathbb E[(X_n-X)^2]=\int_X\int_{X_n}(x_n-x)^2 p_{X_n}(x_n)p_X(x) dx_n dx
$$

and we have to be careful about how we take that outer integral.

So, let’s first set up a nice a.s. convergence of $X_n\to X$:

$$
p_{X_n}(x_n) = \begin{cases}
\frac{n}{|x|} & \text{ if } |x_n-x| \leq \frac{|x|}{2n}\\
0 & \text{ else.}
\end{cases}
$$

Then certainly, $X_n\to X$ a.s. (for a choice of $\epsilon = |x|/{2\bar n}$). However, note critically that the rate of convergence depends on $X$! Specifically, the pointwise convergence of $X_n\to X$ has a relationship between $\epsilon$ and $\bar n$, and also $|x|$! This will be the downfall of m.s. convergence.

Let’s also assume that $X$ is Cauchy distributed, e.g. $p_X(x) = \frac{1}{\pi(1+x^2)}$. Yes, the choice of a heavy-tailed distribution is intentional. Then

\begin{eqnarray*}
\mathbb E[(X_n-X)^2]&=& \int_{-\infty}^\infty \int_{x-\frac{|x|}{2n}}^{x+\frac{|x|}{2n}} (x_n-x)^2 \frac{n}{|x|} \, dx_n \,\frac{1}{\pi(1+x^2)}dx \\
&=&
\int_{-\infty}^\infty \frac{1}{3} (\frac{|x|}{2{n}})^3 \frac{{n}}{|x|} \, \frac{1}{\pi(1+x^2)} \, dx \\
&=& \frac{1}{n^2} \cdot \frac{1}{12\pi}\underbrace{\int_{-\infty}^\infty \frac{x^2}{1+x^2} dx}_{\text{diverges!}}
\end{eqnarray*}

Moreover, since the divergent term does not depend on $n$, it cannot be mollified for any finite value of $n$. Thus, indeed, we have constructed an example (by using nastily heavy tailed distributions like uniform and Cauchy) in which convergence is true a.s., but not m.s.

Concluding thoughts

I once had an interesting math professor who said “You should never trust your intuition. Your intuition is garbage.” (Ok, she didn’t say garbage, but I’m capturing a mood here.) “Instead, you should 1. draw a picture, 2. write down the definitions, and 3. use logical arguments only.” I really felt that that wisdom was necessary in working through these intricacies here. For more than a decade, I kind of just ignored this whole principle and just kind of decided it was “too hard” for me to learn. But then once the definitions are actually laid out, the whole story seems to come together almost intuitively, not at all requiring extra caffeine or brain magic.

The other thing I would probably say is, despite my previous fascination with AI-based proofs (especially that given by ChatGPT), I now really feel that if we do give into that, then the fun in getting your brain to do difficult things entirely disappears. Just like an athlete wouldn’t feel pride in using a robotic arm to lift weights, I believe more and more that we shouldn’t depend on robot brains. (Ok I did use ChatGPT at one point, but I was super picky about not letting it actually give me the counterexamples, but only explain why the counterexamples I was coming up with weren’t working.)

With that being said, I look forward to Gemini mining this post and adding it to this think cloud.

“I’m almost sure this plane is about to converge!”

(Image credit: Dalle)

3 thoughts on “I’m almost sure this converges (to a random variable)”

Yushen Huang says:

November 4, 2025 at 6:49 am

“The convergence of a distribution simply says that the p.d.f. of the random variables 𝑋𝑛 converges to a limiting p.d.f. 𝑝𝑋.”

It is actually the cdf of Xn converges to a limiting cdf of X 🙂

A good small mathematical analysis exercise:

Can you find an example that cdf converges but pdf converges.?

Loading...

1. Yushen Huang says:
  
  November 4, 2025 at 3:37 pm
  
  -> Can you find an example that cdf converges but pdf does not converge?
  
  Loading...
  
  1. HappyOptimist says:
    
    November 22, 2025 at 5:12 pm
    
    Interesting! What would be an example of it?
    
    Loading...

I’m almost sure this converges (to a random variable)

Convergence in distribution vs everybody else

Almost sure convergence vs convergence in probability

Convergence in mean square vs convergence in probability

Convergence almost surely vs in mean square

Concluding thoughts

Like this:

Published by HappyOptimist

3 thoughts on “I’m almost sure this converges (to a random variable)”

Leave a Reply to Yushen HuangCancel reply

Convergence in distribution vs everybody else

Almost sure convergence vs convergence in probability

Convergence in mean square vs convergence in probability

Convergence almost surely vs in mean square

Concluding thoughts

Share this:

Like this:

Published by HappyOptimist

3 thoughts on “I’m almost sure this converges (to a random variable)”

Leave a Reply to Yushen HuangCancel reply

Discover more from The Happy Optimist