Problem 10

Question

A method to investigate the sensitivity of the sample mean and the sample median to extreme outliers is to replace one or more elements in a given dataset by a number \(y\) and investigate the effect when \(y\) goes to infinity. To illustrate this, consider the dataset from Quick Exercise 16.1: $$ \begin{array}{lllll} 4.6 & 3.0 & 3.2 & 4.2 & 5.0 \end{array} $$ with sample mean 4 and sample median \(4.2\). a. We replace the element \(3.2\) by some real number \(y\). What happens with the sample mean and the sample median of this new dataset as \(y \rightarrow \infty\) ? b. We replace a number of elements by some real number \(y\). How many elements do we need to replace so that the sample median of the new dataset goes to infinity as \(y \rightarrow \infty\) ? c. Suppose we have another dataset of size \(n\). How many elements do we need to replace by some real number \(y\), so that the sample mean of the new dataset goes to infinity as \(y \rightarrow \infty\) ? And how many elements do we need to replace, so that the sample median of the new dataset goes to infinity?

Step-by-Step Solution

Verified
Answer
a. Mean goes to infinity; median stays 4.6. b. Replace at least 3 numbers. c. Replace 1 for mean, \(\lceil n/2 \rceil\) for median.
1Step 1: Analyzing the sample mean effect
The original dataset is \( \{4.6, 3.0, 3.2, 4.2, 5.0\} \). The sample mean is calculated as \( \mu = \frac{4.6 + 3.0 + 3.2 + 4.2 + 5.0}{5} = 4 \). If we replace \(3.2\) with \(y\), the new dataset becomes \( \{4.6, 3.0, y, 4.2, 5.0\} \). The new sample mean is \( \mu = \frac{4.6+3.0+y+4.2+5.0}{5} = \frac{16.8+y}{5} \). As \( y \rightarrow \infty \), \( \mu \rightarrow \infty \) since \(16.8\) is a constant and \(y\) dominates as it increases without bound.
2Step 2: Analyzing the sample median effect
The original median of the dataset \( \{4.6, 3.0, 3.2, 4.2, 5.0\} \) is \( 4.2 \). With \(3.2\) replaced by \(y\), the dataset becomes \( \{3.0, 4.2, 4.6, 5.0, y\} \) (assuming \(y > 5.0\) for simplification as \(y\rightarrow\infty\)). The median of this sorted dataset is still \(4.6\). Therefore, replacing one element does not affect the median, and it remains at \( 4.6 \) even as \( y \rightarrow \infty \).
3Step 3: Finding elements for median to go to infinity
For the median of a dataset to go to infinity, more than half of the data should be affected by \(y\). With five data points, we need to replace at least three of them. The dataset might look like \( \{y, y, y, 4.2, 5.0\} \), and when sorted as \( \{4.2, 5.0, y, y, y\} \), the median becomes \(y\) when \(y\) is the middle value, hence \(y \rightarrow \infty\). Thus, replacing three elements makes the median increase toward infinity.
4Step 4: General results for sample mean and median in dataset size n
For an \(n\)-sized dataset, to make the mean go to infinity, replace at least one element, as the mean formula involves summing all values. For the median, replace more than half the dataset. Specifically, replace \(\lceil \frac{n}{2} \rceil\) elements for the median to tend to infinity, as \(\lceil x \rceil\) denotes the ceiling function, which rounds \(x\) up to the nearest integer.

Key Concepts

Outliers in StatisticsEffect of Outliers on MeanEffect of Outliers on MedianDataset Modification
Outliers in Statistics
In statistics, an outlier is a data point significantly different from other observations. This deviation can occur because of variability in the measurement or it could indicate an experimental error. Detecting outliers is crucial as they can skew and misinform data interpretations and analyses.
An outlier might be a value that's much higher or much lower than the others in a dataset, and it's essential to identify these as they can affect statistical results in profound ways. Understanding how outliers are defined and identified helps analysts in making better decisions about which data to consider for their analyses.
Effect of Outliers on Mean
The effect of outliers on the mean is substantial. The mean, or average, calculates the central value of a set by summing all observations and dividing by the count of those observations. Since each number in the dataset contributes equally to the final result, an outlier can greatly alter the mean.
For instance, consider a dataset like \( \{4.6, 3.0, 3.2, 4.2, 5.0\} \). When a single value is altered significantly, such as replacing \( 3.2 \) with \( y \, \) which tends toward infinity, the mean effectively increases indefinitely. This shows how the mean is sensitive to extreme values because it incorporates every value equally, making it an unstable measure in the presence of outliers.
Effect of Outliers on Median
In contrast to the mean, the median is more resistant to outliers. The median identifies the middle value in a sorted dataset, which makes it robust to extreme changes in a few values. When replacing a number with an outlier \( y \rightarrow \infty \), the median remains largely unaffected until more than half of the dataset is composed of these extreme values.
For the example dataset \(\{4.6, 3.0, 3.2, 4.2, 5.0\}\), replacing just \( 3.2 \) with a very large number \( y \) doesn't change the median immediately because it focuses on the middle number in the set. Therefore, the median offers a more stable central tendency measure when outliers are present.
Dataset Modification
Modifying a dataset by replacing its elements can help understand the robustness of statistical measures like the mean and median. In exercises focusing on this aspect, altering values assists in viewing the impact of changes.
For instance, if you take a dataset of size \( n \) and modify its elements to investigate the effect on statistics like the mean and median, you see different results. To make the mean go towards infinity, one element is often enough due to its susceptibility to extremes. However, for the median, more than half the dataset must be influenced, as a few outliers do not change the position of the middle value.
  • ### Tips for modifying datasets:
  • Identify which values significantly affect your statistical results.
  • Understand which statistical measure suits your analysis better.
  • Use more outliers for testing median changes.
Experiments with dataset modification provide valuable insights into the behavior and reliability of these statistical metrics.