Statistics is a field that empowers us to transform raw data into insightful information. Among the most commonly used techniques in statistics is the calculation of the mean, or the average. However, when we aim to minimize the impact of outliers or extreme values, we can employ the Median of Means method. This statistical approach provides a more robust estimate of a dataset’s central tendency. Let’s delve deeper into understanding this method.
Classical Mean Estimation
Before we explore the Median of Means method, it’s crucial to comprehend the classical approach to estimating the mean of a distribution. The mean, a fundamental concept in statistics, represents the central or ‘average’ value of a dataset. The classical mean estimator, often symbolized as , is employed to estimate the mean of a distribution.
Definition (Estimator ) Given a random sample drawn from a distribution with an unknown mean and a known standard deviation , the estimator of the mean, , is calculated as the sum of the sampled values divided by the number of samples. This can be expressed mathematically as:
where is the total number of samples. represents each individual sample from the distribution.
This formula provides us with the arithmetic mean of the sampled values, which serves as our best estimate of the true mean of the distribution.
The next question that arises is: how effective is this estimator? We provide the following bound to demonstrate that this mean converges with the speed .
Proposition 1 (Convergence of Classical Mean Estimation) Let be a sequence of independent and identically distributed random variables drawn from a population with a true mean and a standard deviation . Let be the sample mean. For any , the probability that the absolute difference between the sample mean and the true mean is bounded as
This proposition’s proof is a direct application of Chebyshev’s inequality to the random variable , which has mean and variance .
The terms we’ll be using are defined as follows:
- is the true mean of a population.
- is the sample mean, given by .
- is the standard deviation of the population.
- is a value greater than 0.
- is the number of samples.
Chebyshev’s inequality states that for any random variable with a finite mean and variance , the probability that the random variable deviates from its mean by more than standard deviations is at most . Formally, it is stated as .
We aim to estimate the mean of a distribution using samples and provide an upper bound for the probability that the absolute difference between the estimated mean and the true mean is greater than .
First, we define a new random variable , which is the difference between the sample mean and the true mean. The mean of is .
The variance of is , where we used the fact that the variance of the sum of independent random variables is the sum of their variances.
Applying Chebyshev’s inequality to :
This is equivalent to:
So, we have shown that the probability that the absolute difference between the estimated mean and the true mean is greater than is at most , as required.
Understanding the Median of Means Method
The Median of Means (MoM) method is a statistical technique that merges the principles of both the mean and the median to yield a more robust measure of central tendency. It is particularly beneficial when dealing with large datasets that may contain outliers or extreme values that can distort the mean.
The MoM method operates by dividing the dataset into several smaller subsets, calculating the mean of each subset, and then determining the median of these means. This approach can offer a more accurate representation of the central tendency of the data, especially in datasets with extreme values.
Calculating the Median of Means
As we’ve discussed, the Median of Means (MoM) method is a potent statistical technique that provides a robust measure of central tendency, especially when dealing with large datasets or those with outliers. Let’s delve into a more rigorous definition of the MoM method.
In this scenario, we have a total of samples that we’ll divide into groups. Each group contains samples, such that .
Definition (Median of Means Estimator) Given a random sample drawn from a distribution, we first divide these samples into groups, each containing samples. For each group , we calculate the mean of the samples in that group, denoted as . Mathematically, this can be expressed as:
where is the number of samples in each group. represents each individual sample in group . The Median of Means estimator, denoted as , is then calculated as the median of these means:
The median is the middle value when the means are arranged in ascending order. If there is an even number of means, the median is the average of the two middle values.
The steps involved in the MoM method are:
-
Divide the Dataset: We begin by dividing the entire dataset into groups, each containing samples. This division is such that the total number of samples equals . The choice of can depend on the size of the data and the level of precision desired.
-
Calculate the Mean of Each Subset: For each group , we calculate the mean of the samples in that group. This is done by summing all the values in the group and dividing by the number of values . Mathematically, this is represented as:
-
Calculate the Median of the Means: After calculating the mean of each group, we have a new set of data: the means . We then calculate the Median of Means estimator, denoted as , which is the median of these means. The median is the middle value when the means are arranged in ascending order. If there is an even number of means, the median is the average of the two middle values.
Advantages of the Median of Means Method
The MoM method offers several advantages over traditional measures of central tendency, such as the mean or median alone. The MoM method is less sensitive to outliers or extreme values, which can significantly distort the mean. By calculating the mean of several subsets, the impact of any outliers is reduced. This is formally stated as follows:
Proposition 2 (Convergence of Median of Means) If we have a sequence of independent and identically distributed random variables drawn from a population with a true mean and a standard deviation , and we use the Median of Means estimator calculated from groups each of size , then for any , the probability that the absolute difference between the Median of Means estimator and the true mean is greater than is at most , where is a constant that depends on the specifics of the distribution and the sample size. namely
Now, let’s prove this.
We begin the proof by applying Chebyshev’s inequality.
Step 1: Using Chebyshev’s Inequality
Chebyshev’s inequality states that for any random variable with a finite expected value and finite non-zero variance , the probability that the random variable deviates from its mean by more than standard deviations is at most .
In our case, we have a random sample drawn from a distribution with mean and variance . We divide these samples into groups, each containing samples. For each group , we calculate the mean of the samples in that group, denoted as .
The variance of each is , because the variance of the mean of independent samples from a distribution with variance is .
Using Chebyshev’s inequality, we have:
Simplifying the right-hand side, we get:
Step 2: Using Median Property
Now we want to show that if , then at least half of the ’s are .
This is a property of the median. If the median of a set of numbers is greater than some value, then at least half of the numbers in the set are greater than that value.
Step 3: Defining Bernoulli Random Variables
Let’s define as a Bernoulli random variable, where:
- if
- otherwise
We have such variables, , each corresponding to one group.
Step 4: Applying Chebyshev’s Inequality to Bernoulli Variables
From Step 1, we know that the probability of is less than or equal to , i.e., . This is the expectation of each Bernoulli random variable .
Step 5: Using Hoeffding’s Inequality
We aim to bound the probability that the sum of the ’s is at least , which is equivalent to saying that the average of the ’s is at least .
We can use Hoeffding’s inequality here, which states that for any ,
Substituting and into the inequality, we get:
Simplifying, we get:
Assuming (which ensures that the exponent is negative), we can simplify the exponent to , which is proportional to for some constant .
In Big O notation, this tail bound can be represented as:
This signifies that the probability decreases exponentially as increases, where is a constant that depends on the specifics of the distribution and the sample size.
Upon comparing the results from Proposition 1 () and Proposition 2 (), we can discern that the convergence rates differ significantly. The classical mean estimation method converges at a rate of . In contrast, the Median of Means Method demonstrates a more rapid convergence rate of , which is markedly faster.
In conclusion, the Median of Means method is a powerful tool in statistical analysis, providing a robust and accurate measure of central tendency, especially in large datasets with outliers. By understanding and implementing this method, you can extract deeper insights from your data and make more informed decisions.
Reference
[1] Vershynin R. High-dimensional probability: An introduction with applications in data science[M]. Cambridge university press, 2018.