Median of Means Method: A Robust Approach to Estimating the Mean

By Z.H. Fu https://fuzihaofzh.github.io/blog/

Statistics is a field that empowers us to transform raw data into insightful information. Among the most commonly used techniques in statistics is the calculation of the mean, or the average. However, when we aim to minimize the impact of outliers or extreme values, we can employ the Median of Means method. This statistical approach provides a more robust estimate of a dataset’s central tendency. Let’s delve deeper into understanding this method.

Classical Mean Estimation

Before we explore the Median of Means method, it’s crucial to comprehend the classical approach to estimating the mean of a distribution. The mean, a fundamental concept in statistics, represents the central or ‘average’ value of a dataset. The classical mean estimator, often symbolized as $\hat{\mu}$ , is employed to estimate the mean $\mu$ of a distribution.

Definition (Estimator $\hat{\mu}$ ) Given a random sample $x_1,...,x_N$ drawn from a distribution with an unknown mean $\mu$ and a known standard deviation $\sigma$ , the estimator of the mean, $\hat{\mu}$ , is calculated as the sum of the sampled values divided by the number of samples. This can be expressed mathematically as:
$\hat{\mu} = \frac{1}{N}\sum_{i=1}^{N}x_i,$
where $N$ is the total number of samples. $x_i$ represents each individual sample from the distribution.

This formula provides us with the arithmetic mean of the sampled values, which serves as our best estimate of the true mean $\mu$ of the distribution.

The next question that arises is: how effective is this estimator? We provide the following bound to demonstrate that this mean converges with the speed $O(\frac{1}{t^2})$ .

Proposition 1 (Convergence of Classical Mean Estimation) Let $(X_i)_{i=1}^N$ be a sequence of independent and identically distributed random variables drawn from a population with a true mean $\mu$ and a standard deviation $\sigma$ . Let $\hat{\mu} = \frac{1}{N}\sum_{i=1}^N X_i$ be the sample mean. For any $t > 0$ , the probability that the absolute difference between the sample mean $\hat{\mu}$ and the true mean $\mu$ is bounded as
$P\left( \left| \hat{\mu} - \mu \right| > \frac{\sigma}{N}t \right) \leq \frac{1}{t^2}.$

This proposition’s proof is a direct application of Chebyshev’s inequality to the random variable $Y = \hat{\mu} - \mu$ , which has mean $0$ and variance $\frac{\sigma^2}{N}$ .

The terms we’ll be using are defined as follows:

$\mu$ is the true mean of a population.
$\hat{\mu}$ is the sample mean, given by $\frac{1}{N}\sum_{i=1}^{N} x_i$ .
$\sigma$ is the standard deviation of the population.
$t$ is a value greater than 0.
$N$ is the number of samples.

Chebyshev’s inequality states that for any random variable with a finite mean $\mu$ and variance $\sigma^2$ , the probability that the random variable deviates from its mean by more than $k$ standard deviations is at most $\frac{1}{k^2}$ . Formally, it is stated as $P\{ |X - \mu| \geq k\sigma \} \leq \frac{1}{k^2}$ .

We aim to estimate the mean $\mu$ of a distribution using $N$ samples and provide an upper bound for the probability that the absolute difference between the estimated mean $\hat{\mu}$ and the true mean $\mu$ is greater than $\frac{\sigma}{N} t$ .

First, we define a new random variable $Y = \frac{1}{N}\sum_{i=1}^{N} x_i - \mu$ , which is the difference between the sample mean and the true mean. The mean of $Y$ is $\mathbb{E}[Y] = \mathbb{E}[\frac{1}{N}\sum_{i=1}^{N} x_i] - \mu = \mu - \mu = 0$ .

The variance of $Y$ is $Var[Y] = Var[\frac{1}{N}\sum_{i=1}^{N} x_i] = \frac{1}{N^2}Var[\sum_{i=1}^{N} x_i] = \frac{1}{N^2}N\sigma^2 = \frac{\sigma^2}{N}$ , where we used the fact that the variance of the sum of independent random variables is the sum of their variances.

Applying Chebyshev’s inequality to $Y$ :

P\{ |Y - 0| \geq t\frac{\sigma}{N} \} \leq \frac{1}{t^2}.

This is equivalent to:

P\{ |\frac{1}{N}\sum_{i=1}^{N} x_i - \mu| \geq t\frac{\sigma}{N} \} \leq \frac{1}{t^2}

So, we have shown that the probability that the absolute difference between the estimated mean $\hat{\mu}$ and the true mean $\mu$ is greater than $\frac{\sigma}{N} t$ is at most $\frac{1}{t^2}$ , as required.

Understanding the Median of Means Method

The Median of Means (MoM) method is a statistical technique that merges the principles of both the mean and the median to yield a more robust measure of central tendency. It is particularly beneficial when dealing with large datasets that may contain outliers or extreme values that can distort the mean.

The MoM method operates by dividing the dataset into several smaller subsets, calculating the mean of each subset, and then determining the median of these means. This approach can offer a more accurate representation of the central tendency of the data, especially in datasets with extreme values.

Calculating the Median of Means

As we’ve discussed, the Median of Means (MoM) method is a potent statistical technique that provides a robust measure of central tendency, especially when dealing with large datasets or those with outliers. Let’s delve into a more rigorous definition of the MoM method.

In this scenario, we have a total of $N$ samples that we’ll divide into $K$ groups. Each group contains $M$ samples, such that $N = K \cdot M$ .

Definition (Median of Means Estimator) Given a random sample $x_1,...,x_N$ drawn from a distribution, we first divide these samples into $K$ groups, each containing $M$ samples. For each group $k$ , we calculate the mean of the samples in that group, denoted as $\hat{\mu}_k$ . Mathematically, this can be expressed as:
$\hat{\mu}_k = \frac{1}{M}\sum_{i=1}^{M}x_{ki},$
where $M$ is the number of samples in each group. $x_{ki}$ represents each individual sample in group $k$ . The Median of Means estimator, denoted as $\tilde{\mu}$ , is then calculated as the median of these means:
$\tilde{\mu}=\text{Median}(\hat{\mu}_1,...,\hat{\mu}_K)$

The median is the middle value when the means are arranged in ascending order. If there is an even number of means, the median is the average of the two middle values.

The steps involved in the MoM method are:

Divide the Dataset: We begin by dividing the entire dataset into $K$ groups, each containing $M$ samples. This division is such that the total number of samples $N$ equals $K \cdot M$ . The choice of $K$ can depend on the size of the data and the level of precision desired.
Calculate the Mean of Each Subset: For each group $k$ , we calculate the mean $\hat{\mu}_k$ of the samples in that group. This is done by summing all the values in the group and dividing by the number of values $M$ . Mathematically, this is represented as:
$\hat{\mu}_k = \frac{1}{M}\sum_{i=1}^{M}x_{ki}$
Calculate the Median of the Means: After calculating the mean of each group, we have a new set of data: the means $\hat{\mu}_1,...,\hat{\mu}_K$ . We then calculate the Median of Means estimator, denoted as $\hat{\mu}_{MoM}$ , which is the median of these means. The median is the middle value when the means are arranged in ascending order. If there is an even number of means, the median is the average of the two middle values.

Advantages of the Median of Means Method

The MoM method offers several advantages over traditional measures of central tendency, such as the mean or median alone. The MoM method is less sensitive to outliers or extreme values, which can significantly distort the mean. By calculating the mean of several subsets, the impact of any outliers is reduced. This is formally stated as follows:

Proposition 2 (Convergence of Median of Means) If we have a sequence of independent and identically distributed random variables $(X_i)_{i=1}^N$ drawn from a population with a true mean $\mu$ and a standard deviation $\sigma$ , and we use the Median of Means estimator $\tilde{\mu}$ calculated from $K$ groups each of size $M$ , then for any $t > 0$ , the probability that the absolute difference between the Median of Means estimator $\tilde{\mu}$ and the true mean $\mu$ is greater than $\frac{\sigma}{\sqrt{M}}t$ is at most $2\exp(-ct^2)$ , where $c$ is a constant that depends on the specifics of the distribution and the sample size. namely
$P\left\{\left| \tilde{\mu} - \mu \right| \ge \frac{\sigma}{N}t\right\} = O\left(\exp(-ct^2)\right)$

Now, let’s prove this.

We begin the proof by applying Chebyshev’s inequality.

Step 1: Using Chebyshev’s Inequality

Chebyshev’s inequality states that for any random variable with a finite expected value $\mu$ and finite non-zero variance $\sigma^2$ , the probability that the random variable deviates from its mean by more than $t$ standard deviations is at most $\frac{\sigma^2}{t^2}$ .

In our case, we have a random sample $x_1,...,x_N$ drawn from a distribution with mean $\mu$ and variance $\sigma^2$ . We divide these samples into $K$ groups, each containing $M$ samples. For each group $k$ , we calculate the mean of the samples in that group, denoted as $\hat{\mu}_k$ .

The variance of each $\hat{\mu}_k$ is $\sigma^2/M$ , because the variance of the mean of $M$ independent samples from a distribution with variance $\sigma^2$ is $\sigma^2/M$ .

Using Chebyshev’s inequality, we have:

P\{\hat{\mu}_k - \mu \ge t\sigma/\sqrt{N}\} \le \frac{\sigma^2/M}{(t\sigma/\sqrt{N})^2}

Simplifying the right-hand side, we get:

P\{\hat{\mu}_k - \mu \ge t\sigma/\sqrt{N}\} \le \frac{K}{t^2}

Step 2: Using Median Property

Now we want to show that if $\tilde{\mu} \ge \mu + t\sigma/\sqrt{N}$ , then at least half of the $\hat{\mu}_k$ ’s are $\ge \mu + t\sigma/\sqrt{N}$ .

This is a property of the median. If the median of a set of numbers is greater than some value, then at least half of the numbers in the set are greater than that value.

Step 3: Defining Bernoulli Random Variables

Let’s define $X_k$ as a Bernoulli random variable, where:

$X_k = 1$ if $\hat{\mu}_k \ge \mu + \frac{t\sigma}{\sqrt{N}}$
$X_k = 0$ otherwise

We have $K$ such variables, $X_1, X_2, ..., X_K$ , each corresponding to one group.

Step 4: Applying Chebyshev’s Inequality to Bernoulli Variables

From Step 1, we know that the probability of $X_k = 1$ is less than or equal to $\frac{K}{t^2}$ , i.e., $P\{X_k = 1\} \le \frac{K}{t^2}$ . This is the expectation $\mathbb{E}[X_k]$ of each Bernoulli random variable $X_k$ .

Step 5: Using Hoeffding’s Inequality

We aim to bound the probability that the sum of the $X_k$ ’s is at least $K/2$ , which is equivalent to saying that the average of the $X_k$ ’s is at least $1/2$ .

We can use Hoeffding’s inequality here, which states that for any $a \in [0,1]$ ,

P\left\{\frac{1}{K}\sum_{k=1}^{K}X_k \ge a\right\} \le \exp\left(-2K\left(a - \mathbb{E}[X_k]\right)^2\right)

Substituting $\mathbb{E}[X_k] = \frac{K}{t^2}$ and $a = 1/2$ into the inequality, we get:

P\left\{\frac{1}{K}\sum_{k=1}^{K}X_k \ge \frac{1}{2}\right\} \le \exp\left(-2K\left(\frac{1}{2} - \frac{K}{t^2}\right)^2\right)

Simplifying, we get:

P\left\{\frac{1}{K}\sum_{k=1}^{K}X_k \ge \frac{1}{2}\right\} \le \exp\left(-\frac{K(t^2 - 2K)^2}{2t^4}\right)

Assuming $t^2 > 2K$ (which ensures that the exponent is negative), we can simplify the exponent to $-\frac{K}{2}$ , which is proportional to $-ct^2$ for some constant $c > 0$ .

In Big O notation, this tail bound can be represented as:

P\left\{\frac{1}{K}\sum_{k=1}^{K}X_k \ge \frac{1}{2}\right\} = O\left(\exp(-ct^2)\right)

This signifies that the probability decreases exponentially as $t^2$ increases, where $c$ is a constant that depends on the specifics of the distribution and the sample size.

Upon comparing the results from Proposition 1 ( $P\left( \left| \hat{\mu} - \mu \right| > \frac{\sigma}{N}t \right) \leq \frac{1}{t^2}$ ) and Proposition 2 ( $P\left\{\left| \tilde{\mu} - \mu \right| \ge \frac{\sigma}{N}t\right\} = O\left(\exp(-ct^2)\right)$ ), we can discern that the convergence rates differ significantly. The classical mean estimation method converges at a rate of $O(\frac{1}{t^2})$ . In contrast, the Median of Means Method demonstrates a more rapid convergence rate of $O\left(\exp(-ct^2)\right)$ , which is markedly faster.

In conclusion, the Median of Means method is a powerful tool in statistical analysis, providing a robust and accurate measure of central tendency, especially in large datasets with outliers. By understanding and implementing this method, you can extract deeper insights from your data and make more informed decisions.

Reference

[1] Vershynin R. High-dimensional probability: An introduction with applications in data science[M]. Cambridge university press, 2018.