What is a sub-Gaussian distribution? What are its theoretical properties and practical applications?
Relevant lecture notes: MIT OCW 18.S997: High Dimensional Statistics
What is a sub-Gaussian distribution? What are its theoretical properties and practical applications?
Relevant lecture notes: MIT OCW 18.S997: High Dimensional Statistics
The motivation for introducing the sub-Gaussian distribution stems from the strong tail decay of the Gaussian distribution:
If $X \sim \mathcal{N}(\mu, \sigma^2)$, then:
$$\mathbb{P}(X - \mu > t) \leq \dfrac{1}{\sqrt{2 \pi}} \dfrac{\mathrm{e}^{-\frac{t^2}{2 \sigma^2}}}{t},$$
and its moment-generating function is:
$$\mathbb{E}[\exp(s X)] = \exp\left(s \mu + \dfrac{\sigma^2 s^2}{2}\right).$$
A random variable $X \in \mathbb{R}$ is said to follow a sub-Gaussian distribution, denoted $X \sim \operatorname{subG}(\sigma^2)$, if it satisfies the following properties:
where $\sigma^2$ is the variance parameter.
Using Markov's Lemma, we can derive the following theorem:
If $X \sim \operatorname{subG}(\sigma^2)$, then for any $t > 0$:
$$\mathbb{P}[X > t] \leq \exp\left(-\dfrac{t^2}{2 \sigma^2}\right), \quad \mathbb{P}[X < -t] \leq \exp\left(-\dfrac{t^2}{2 \sigma^2}\right). \tag{1}$$
We use a standard Chernoff bound technique:
$$\mathbb{P}(X > t) \leq \mathbb{P}\left(e^{s X} > e^{s t}\right) \leq \dfrac{\mathbb{E}\left[e^{s X}\right]}{e^{s t}} \leq e^{\dfrac{\sigma^2 s^2}{2} - s t}. \tag{2}$$
By choosing an appropriate $s$, the result follows.
Moment Inequality: If $\mathbb{P}[|X| > t] \leq 2 \exp\left(-\dfrac{t^2}{2 \sigma^2}\right)$, then for any $k \geq 1$:
$$\mathbb{E}\left[|X|^k\right] \leq \left(2 \sigma^2\right)^{k/2} k \Gamma(k/2). \tag{3}$$
In particular (after scaling), $\mathbb{E}[|X|] \leq \sigma \sqrt{2 \pi}$. The proof uses:
$$\mathbb{E}\left[|X|^k\right] = \int_0^\infty \mathbb{P}\left(|X|^k > t\right) \mathrm{d} t.$$
Converse Application: If equation (1) holds, then for $\forall s > 0$:
$$\mathbb{E}[\exp(s X)] \leq e^{4 \sigma^2 s^2}.$$
Multivariate Property: Let $X = (X_1, \ldots, X_n)$, where each $X_i$ follows a sub-Gaussian distribution. Then $X$ also follows a sub-Gaussian distribution.
Let $X$ be a random variable with $\mathbb{E}(X) = 0$ and $X \in [a, b]$ almost surely. Then for $\forall s \in \mathbb{R}$:
$$\mathbb{E}\left[e^{s X}\right] \leq e^{\dfrac{s^2 (b - a)^2}{8}}. \tag{4}$$
In particular, $X \sim \operatorname{subG}\left(\dfrac{(b - a)^2}{4}\right)$.
Define $\psi(s) = \log \mathbb{E}\left[e^{s X}\right]$. Then:
$$\psi'(s) = \dfrac{\mathbb{E}\left[X e^{s X}\right]}{\mathbb{E}\left[e^{s X}\right]}, \quad \psi''(s) = \dfrac{\mathbb{E}\left[X^2 e^{s X}\right]}{\mathbb{E}\left[e^{s X}\right]} - \left[\dfrac{\mathbb{E}\left[X e^{s X}\right]}{\mathbb{E}\left[e^{s X}\right]}\right]^2. \tag{5}$$
Thus, $\psi''(s)$ is the variance of $X$ under the probability measure $d\mathbb{Q} = \dfrac{e^{s X}}{\mathbb{E}\left[e^{s X}\right]} d\mathbb{P}$. Since $X \in [a, b]$ almost surely:
$$\operatorname{var}(X) = \operatorname{var}\left(X - \dfrac{a + b}{2}\right) \leq \mathbb{E}\left[\left(X - \dfrac{a + b}{2}\right)^2\right] \leq \dfrac{(b - a)^2}{4}. \tag{6}$$
Finally, using $\psi(0) = \log 1 = 0$, $\psi'(0) = \mathbb{E} X = 0$, we have:
$$\psi(s) = \int_0^s \int_0^\mu \psi''(\rho) d\rho d\mu \leq \dfrac{s^2 (b - a)^2}{8}. \tag{7}$$
This completes the proof.
Some distributions, such as the Laplace distribution, have slower tail decay. For $X \sim \operatorname{Lap}(1)$:
$$\mathbb{P}(|X| > t) = e^{-t}, \quad t \geq 0. \tag{8}$$
This decays more slowly than a Gaussian, indicating a heavier tail.
If $X$ is a random variable with $\mathbb{E}[X] = 0$ and $\mathbb{P}(|X| > t) \leq 2 e^{-2 t / \lambda}$, then the moment inequality holds:
$$\mathbb{E}\left[|X|^k\right] \leq \lambda^k k!. \tag{9}$$
Moreover, its moment-generating function satisfies:
$$\mathbb{E}\left[e^{s X}\right] \leq e^{2 s^2 \lambda^2}, \quad \forall |s| \leq \dfrac{1}{2 \lambda}. \tag{10}$$
The proof is similar to that for sub-Gaussian properties and is omitted here for brevity, focusing instead on introducing the sub-exponential distribution.
A random variable $X \sim \operatorname{subE}(\lambda)$ if it satisfies:
If $X \sim \operatorname{subG}(\sigma^2)$, then the random variable $Z = X^2 - \mathbb{E}[X^2] \sim \operatorname{subE}(16 \sigma^2)$.
$$ \begin{aligned} \mathbb{E}\left[e^{s Z}\right] &= 1 + \sum_{k=2}^\infty \dfrac{s^k \mathbb{E}\left[X^2 - \mathbb{E}\left[X^2\right]\right]^k}{k!} \\ &\leq 1 + \sum_{k=2}^\infty \dfrac{s^k 2^{k-1} \left(\mathbb{E}\left[X^{2k}\right] + \left(\mathbb{E}\left[X^2\right]\right)^k\right)}{k!} \quad (\text{Jensen}) \\ &\leq 1 + \sum_{k=2}^\infty \dfrac{s^k 4^k \mathbb{E}\left[X^{2k}\right]}{2(k!)} \quad (\text{Jensen again}) \\ &\leq 1 + \sum_{k=2}^\infty \dfrac{s^k 4^k 2 \left(2 \sigma^2\right)^k k!}{2(k!)} \\ &= 1 + \left(8 s \sigma^2\right)^2 \sum_{k=0}^\infty \left(8 s \sigma^2\right)^k \\ &= 1 + 128 s^2 \sigma^4 \quad \text{for } |s| \leq \dfrac{1}{16 \sigma^2} \\ &\leq e^{128 s^2 \sigma^4}. \end{aligned} \tag{10} $$
Let $X_1, \ldots, X_n$ be random variables satisfying $\mathbb{E}(X_i) = 0$ and $X_i \sim \operatorname{subE}(\lambda)$. Define $\bar{X} = \dfrac{1}{n} \sum_{i=1}^n X_i$. For any $t > 0$:
$$\mathbb{P}(\bar{X} > t) \vee \mathbb{P}(\bar{X} < -t) \leq \exp\left[-\dfrac{n}{2}\left(\dfrac{t^2}{\lambda^2} \wedge \dfrac{t}{\lambda}\right)\right]. \tag{11}$$
Note: Here, $a \vee b = \max\{a, b\}$, $a \wedge b = \min\{a, b\}$.
Assume $\lambda = 1$. Using the Chernoff bound technique, for $\forall s > 0$:
$$\mathbb{P}(\bar{X} > t) \leq \prod_{i=1}^n \mathbb{E}\left[e^{s X_i}\right] e^{-s n t}. \tag{12}$$
When $s \leq 1$, $\mathbb{E}\left[e^{s X_i}\right] \leq e^{s^2 / 2}$, so:
$$\mathbb{P}(\bar{X} > t) \leq e^{\dfrac{n s^2}{2} - s n t}.$$
Choosing $s = 1 \wedge t$ yields:
$$\mathbb{P}(\bar{X} > t) \leq e^{-\dfrac{n}{2}\left(t^2 \wedge t\right)}. \tag{13}$$
Note: Bernsteinβs inequality often refers to a tighter bound with several variants, which may be organized further in the future.
Let $X_1, \ldots, X_N$ be $N$ random variables with $X_i \sim \operatorname{subG}(\sigma^2)$. Then:
$$\mathbb{E}\left[\max_{1 \leq i \leq N} X_i\right] \leq \sigma \sqrt{2 \log N}, \quad \text{and} \quad \mathbb{E}\left[\max_{1 \leq i \leq N} |X_i|\right] \leq \sigma \sqrt{2 \log (2N)}. \tag{14}$$
Furthermore, for $\forall t > 0$:
$$\mathbb{P}\left(\max_{1 \leq i \leq N} X_i > t\right) \leq N e^{-\dfrac{t^2}{2 \sigma^2}}, \quad \text{and} \quad \mathbb{P}\left(\max_{1 \leq i \leq N} |X_i| > t\right) \leq 2N e^{-\dfrac{t^2}{2 \sigma^2}}. \tag{15}$$
Using the Chernoff Bound technique:
$$ \begin{aligned} \mathbb{E}\left[\max_{1 \leq i \leq N} X_i\right] &= \dfrac{1}{s} \mathbb{E}\left[\log e^{s \max_{1 \leq i \leq N} X_i}\right] \\ &\leq \dfrac{1}{s} \log \mathbb{E}\left[e^{s \max_{1 \leq i \leq N} X_i}\right] \quad (\text{Jensen}) \\ &= \dfrac{1}{s} \log \mathbb{E}\left[\max_{1 \leq i \leq N} e^{s X_i}\right] \\ &\leq \dfrac{1}{s} \log \sum_{1 \leq i \leq N} \mathbb{E}\left[e^{s X_i}\right] \\ &\leq \dfrac{1}{s} \log \sum_{1 \leq i \leq N} e^{\dfrac{\sigma^2 s^2}{2}} \\ &= \dfrac{\log N}{s} + \dfrac{\sigma^2 s}{2}. \end{aligned} \tag{16} $$
Taking $s = \sqrt{2 \log N / \sigma^2}$ proves the left inequality of (14).
For the left inequality of (15), consider subadditivity:
$$ \begin{aligned} \mathbb{P}\left(\max_{1 \leq i \leq N} X_i > t\right) &= \mathbb{P}\left(\bigcup_{1 \leq i \leq N} \{X_i > t\}\right) \\ &\leq \sum_{1 \leq i \leq N} \mathbb{P}(X_i > t) \\ &\leq N e^{-\dfrac{t^2}{2 \sigma^2}}. \end{aligned} \tag{17} $$
For the absolute value case, note that:
$$\max_{1 \leq i \leq N} |X_i| = \max_{1 \leq i \leq 2N} X_i,$$
where $X_{N+i} = -X_i$.