0%

Median of Means Method: A Robust Approach to Estimating the Mean

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

Statistics is a field that empowers us to transform raw data into insightful information. Among the most commonly used techniques in statistics is the calculation of the mean, or the average. However, when we aim to minimize the impact of outliers or extreme values, we can employ the Median of Means method. This statistical approach provides a more robust estimate of a dataset’s central tendency. Let’s delve deeper into understanding this method.

Classical Mean Estimation

Before we explore the Median of Means method, it’s crucial to comprehend the classical approach to estimating the mean of a distribution. The mean, a fundamental concept in statistics, represents the central or ‘average’ value of a dataset. The classical mean estimator, often symbolized as μ^\hat{\mu}, is employed to estimate the mean μ\mu of a distribution.

Definition (Estimator μ^\hat{\mu}) Given a random sample x1,...,xNx_1,...,x_N drawn from a distribution with an unknown mean μ\mu and a known standard deviation σ\sigma, the estimator of the mean, μ^\hat{\mu}, is calculated as the sum of the sampled values divided by the number of samples. This can be expressed mathematically as:

μ^=1Ni=1Nxi,\hat{\mu} = \frac{1}{N}\sum_{i=1}^{N}x_i,

where NN is the total number of samples. xix_i represents each individual sample from the distribution.

This formula provides us with the arithmetic mean of the sampled values, which serves as our best estimate of the true mean μ\mu of the distribution.

The next question that arises is: how effective is this estimator? We provide the following bound to demonstrate that this mean converges with the speed O(1t2)O(\frac{1}{t^2}).

Proposition 1 (Convergence of Classical Mean Estimation) Let (Xi)i=1N(X_i)_{i=1}^N be a sequence of independent and identically distributed random variables drawn from a population with a true mean μ\mu and a standard deviation σ\sigma. Let μ^=1Ni=1NXi\hat{\mu} = \frac{1}{N}\sum_{i=1}^N X_i be the sample mean. For any t>0t > 0, the probability that the absolute difference between the sample mean μ^\hat{\mu} and the true mean μ\mu is bounded as

P(μ^μ>σNt)1t2.P\left( \left| \hat{\mu} - \mu \right| > \frac{\sigma}{N}t \right) \leq \frac{1}{t^2}.

This proposition’s proof is a direct application of Chebyshev’s inequality to the random variable Y=μ^μY = \hat{\mu} - \mu, which has mean 00 and variance σ2N\frac{\sigma^2}{N}.

The terms we’ll be using are defined as follows:

  • μ\mu is the true mean of a population.
  • μ^\hat{\mu} is the sample mean, given by 1Ni=1Nxi\frac{1}{N}\sum_{i=1}^{N} x_i.
  • σ\sigma is the standard deviation of the population.
  • tt is a value greater than 0.
  • NN is the number of samples.

Chebyshev’s inequality states that for any random variable with a finite mean μ\mu and variance σ2\sigma^2, the probability that the random variable deviates from its mean by more than kk standard deviations is at most 1k2\frac{1}{k^2}. Formally, it is stated as P{Xμkσ}1k2P\{ |X - \mu| \geq k\sigma \} \leq \frac{1}{k^2}.

We aim to estimate the mean μ\mu of a distribution using NN samples and provide an upper bound for the probability that the absolute difference between the estimated mean μ^\hat{\mu} and the true mean μ\mu is greater than σNt\frac{\sigma}{N} t.

First, we define a new random variable Y=1Ni=1NxiμY = \frac{1}{N}\sum_{i=1}^{N} x_i - \mu, which is the difference between the sample mean and the true mean. The mean of YY is E[Y]=E[1Ni=1Nxi]μ=μμ=0\mathbb{E}[Y] = \mathbb{E}[\frac{1}{N}\sum_{i=1}^{N} x_i] - \mu = \mu - \mu = 0.

The variance of YY is Var[Y]=Var[1Ni=1Nxi]=1N2Var[i=1Nxi]=1N2Nσ2=σ2NVar[Y] = Var[\frac{1}{N}\sum_{i=1}^{N} x_i] = \frac{1}{N^2}Var[\sum_{i=1}^{N} x_i] = \frac{1}{N^2}N\sigma^2 = \frac{\sigma^2}{N}, where we used the fact that the variance of the sum of independent random variables is the sum of their variances.

Applying Chebyshev’s inequality to YY:

P{Y0tσN}1t2.P\{ |Y - 0| \geq t\frac{\sigma}{N} \} \leq \frac{1}{t^2}.

This is equivalent to:

P{1Ni=1NxiμtσN}1t2P\{ |\frac{1}{N}\sum_{i=1}^{N} x_i - \mu| \geq t\frac{\sigma}{N} \} \leq \frac{1}{t^2}

So, we have shown that the probability that the absolute difference between the estimated mean μ^\hat{\mu} and the true mean μ\mu is greater than σNt\frac{\sigma}{N} t is at most 1t2\frac{1}{t^2}, as required.

Understanding the Median of Means Method

The Median of Means (MoM) method is a statistical technique that merges the principles of both the mean and the median to yield a more robust measure of central tendency. It is particularly beneficial when dealing with large datasets that may contain outliers or extreme values that can distort the mean.

The MoM method operates by dividing the dataset into several smaller subsets, calculating the mean of each subset, and then determining the median of these means. This approach can offer a more accurate representation of the central tendency of the data, especially in datasets with extreme values.

Calculating the Median of Means

As we’ve discussed, the Median of Means (MoM) method is a potent statistical technique that provides a robust measure of central tendency, especially when dealing with large datasets or those with outliers. Let’s delve into a more rigorous definition of the MoM method.

In this scenario, we have a total of NN samples that we’ll divide into KK groups. Each group contains MM samples, such that N=KMN = K \cdot M.

Definition (Median of Means Estimator) Given a random sample x1,...,xNx_1,...,x_N drawn from a distribution, we first divide these samples into KK groups, each containing MM samples. For each group kk, we calculate the mean of the samples in that group, denoted as μ^k\hat{\mu}_k. Mathematically, this can be expressed as:

μ^k=1Mi=1Mxki,\hat{\mu}_k = \frac{1}{M}\sum_{i=1}^{M}x_{ki},

where MM is the number of samples in each group. xkix_{ki} represents each individual sample in group kk. The Median of Means estimator, denoted as μ~\tilde{\mu}, is then calculated as the median of these means:

μ~=Median(μ^1,...,μ^K)\tilde{\mu}=\text{Median}(\hat{\mu}_1,...,\hat{\mu}_K)

The median is the middle value when the means are arranged in ascending order. If there is an even number of means, the median is the average of the two middle values.

The steps involved in the MoM method are:

  1. Divide the Dataset: We begin by dividing the entire dataset into KK groups, each containing MM samples. This division is such that the total number of samples NN equals KMK \cdot M. The choice of KK can depend on the size of the data and the level of precision desired.

  2. Calculate the Mean of Each Subset: For each group kk, we calculate the mean μ^k\hat{\mu}_k of the samples in that group. This is done by summing all the values in the group and dividing by the number of values MM. Mathematically, this is represented as:

    μ^k=1Mi=1Mxki\hat{\mu}_k = \frac{1}{M}\sum_{i=1}^{M}x_{ki}

  3. Calculate the Median of the Means: After calculating the mean of each group, we have a new set of data: the means μ^1,...,μ^K\hat{\mu}_1,...,\hat{\mu}_K. We then calculate the Median of Means estimator, denoted as μ^MoM\hat{\mu}_{MoM}, which is the median of these means. The median is the middle value when the means are arranged in ascending order. If there is an even number of means, the median is the average of the two middle values.

Advantages of the Median of Means Method

The MoM method offers several advantages over traditional measures of central tendency, such as the mean or median alone. The MoM method is less sensitive to outliers or extreme values, which can significantly distort the mean. By calculating the mean of several subsets, the impact of any outliers is reduced. This is formally stated as follows:

Proposition 2 (Convergence of Median of Means) If we have a sequence of independent and identically distributed random variables (Xi)i=1N(X_i)_{i=1}^N drawn from a population with a true mean μ\mu and a standard deviation σ\sigma, and we use the Median of Means estimator μ~\tilde{\mu} calculated from KK groups each of size MM, then for any t>0t > 0, the probability that the absolute difference between the Median of Means estimator μ~\tilde{\mu} and the true mean μ\mu is greater than σMt\frac{\sigma}{\sqrt{M}}t is at most 2exp(ct2)2\exp(-ct^2), where cc is a constant that depends on the specifics of the distribution and the sample size. namely

P{μ~μσNt}=O(exp(ct2))P\left\{\left| \tilde{\mu} - \mu \right| \ge \frac{\sigma}{N}t\right\} = O\left(\exp(-ct^2)\right)

Now, let’s prove this.

We begin the proof by applying Chebyshev’s inequality.

Step 1: Using Chebyshev’s Inequality

Chebyshev’s inequality states that for any random variable with a finite expected value μ\mu and finite non-zero variance σ2\sigma^2, the probability that the random variable deviates from its mean by more than tt standard deviations is at most σ2t2\frac{\sigma^2}{t^2}.

In our case, we have a random sample x1,...,xNx_1,...,x_N drawn from a distribution with mean μ\mu and variance σ2\sigma^2. We divide these samples into KK groups, each containing MM samples. For each group kk, we calculate the mean of the samples in that group, denoted as μ^k\hat{\mu}_k.

The variance of each μ^k\hat{\mu}_k is σ2/M\sigma^2/M, because the variance of the mean of MM independent samples from a distribution with variance σ2\sigma^2 is σ2/M\sigma^2/M.

Using Chebyshev’s inequality, we have:

P{μ^kμtσ/N}σ2/M(tσ/N)2P\{\hat{\mu}_k - \mu \ge t\sigma/\sqrt{N}\} \le \frac{\sigma^2/M}{(t\sigma/\sqrt{N})^2}

Simplifying the right-hand side, we get:

P{μ^kμtσ/N}Kt2P\{\hat{\mu}_k - \mu \ge t\sigma/\sqrt{N}\} \le \frac{K}{t^2}

Step 2: Using Median Property

Now we want to show that if μ~μ+tσ/N\tilde{\mu} \ge \mu + t\sigma/\sqrt{N}, then at least half of the μ^k\hat{\mu}_k’s are μ+tσ/N\ge \mu + t\sigma/\sqrt{N}.

This is a property of the median. If the median of a set of numbers is greater than some value, then at least half of the numbers in the set are greater than that value.

Step 3: Defining Bernoulli Random Variables

Let’s define XkX_k as a Bernoulli random variable, where:

  • Xk=1X_k = 1 if μ^kμ+tσN\hat{\mu}_k \ge \mu + \frac{t\sigma}{\sqrt{N}}
  • Xk=0X_k = 0 otherwise

We have KK such variables, X1,X2,...,XKX_1, X_2, ..., X_K, each corresponding to one group.

Step 4: Applying Chebyshev’s Inequality to Bernoulli Variables

From Step 1, we know that the probability of Xk=1X_k = 1 is less than or equal to Kt2\frac{K}{t^2}, i.e., P{Xk=1}Kt2P\{X_k = 1\} \le \frac{K}{t^2}. This is the expectation E[Xk]\mathbb{E}[X_k] of each Bernoulli random variable XkX_k.

Step 5: Using Hoeffding’s Inequality

We aim to bound the probability that the sum of the XkX_k’s is at least K/2K/2, which is equivalent to saying that the average of the XkX_k’s is at least 1/21/2.

We can use Hoeffding’s inequality here, which states that for any a[0,1]a \in [0,1],

P{1Kk=1KXka}exp(2K(aE[Xk])2)P\left\{\frac{1}{K}\sum_{k=1}^{K}X_k \ge a\right\} \le \exp\left(-2K\left(a - \mathbb{E}[X_k]\right)^2\right)

Substituting E[Xk]=Kt2\mathbb{E}[X_k] = \frac{K}{t^2} and a=1/2a = 1/2 into the inequality, we get:

P{1Kk=1KXk12}exp(2K(12Kt2)2)P\left\{\frac{1}{K}\sum_{k=1}^{K}X_k \ge \frac{1}{2}\right\} \le \exp\left(-2K\left(\frac{1}{2} - \frac{K}{t^2}\right)^2\right)

Simplifying, we get:

P{1Kk=1KXk12}exp(K(t22K)22t4)P\left\{\frac{1}{K}\sum_{k=1}^{K}X_k \ge \frac{1}{2}\right\} \le \exp\left(-\frac{K(t^2 - 2K)^2}{2t^4}\right)

Assuming t2>2Kt^2 > 2K (which ensures that the exponent is negative), we can simplify the exponent to K2-\frac{K}{2}, which is proportional to ct2-ct^2 for some constant c>0c > 0.

In Big O notation, this tail bound can be represented as:

P{1Kk=1KXk12}=O(exp(ct2))P\left\{\frac{1}{K}\sum_{k=1}^{K}X_k \ge \frac{1}{2}\right\} = O\left(\exp(-ct^2)\right)

This signifies that the probability decreases exponentially as t2t^2 increases, where cc is a constant that depends on the specifics of the distribution and the sample size.

Upon comparing the results from Proposition 1 (P(μ^μ>σNt)1t2P\left( \left| \hat{\mu} - \mu \right| > \frac{\sigma}{N}t \right) \leq \frac{1}{t^2}) and Proposition 2 (P{μ~μσNt}=O(exp(ct2))P\left\{\left| \tilde{\mu} - \mu \right| \ge \frac{\sigma}{N}t\right\} = O\left(\exp(-ct^2)\right)), we can discern that the convergence rates differ significantly. The classical mean estimation method converges at a rate of O(1t2)O(\frac{1}{t^2}). In contrast, the Median of Means Method demonstrates a more rapid convergence rate of O(exp(ct2))O\left(\exp(-ct^2)\right), which is markedly faster.

In conclusion, the Median of Means method is a powerful tool in statistical analysis, providing a robust and accurate measure of central tendency, especially in large datasets with outliers. By understanding and implementing this method, you can extract deeper insights from your data and make more informed decisions.

Reference

[1] Vershynin R. High-dimensional probability: An introduction with applications in data science[M]. Cambridge university press, 2018.