Problem 11
Question
Consider the dataset of measured chest circumferences of 5732 Scottish soldiers (see Exercises 15.1, 17.6, and 18.9). The Kolmogorov-Smirnov distance between the empirical distribution function and the distribution function \(F_{\bar{x}_{n}, s_{n}}\) of the normal distribution with estimated parameters \(\hat{\mu}=\bar{x}_{n}=39.85\) and \(\hat{\sigma}=s_{n}=2.09\) is equal to $$ t_{\mathrm{ks}}=\sup _{a \in \mathbb{R}}\left|F_{n}(a)-F_{\bar{x}_{n}, s_{n}}(a)\right|=0.0987, $$ where \(\bar{x}_{n}\) and \(s_{n}\) denote sample mean and sample standard deviation of the dataset. Suppose we want to perform a bootstrap simulation with one thousand repetitions for the KS distance to investigate to which degree the value \(0.0987\) agrees with the assumed normality of the dataset. Describe the appropriate bootstrap simulation that must be carried out.
Step-by-Step Solution
VerifiedKey Concepts
Kolmogorov-Smirnov test
This distance is called the KS statistic. In this case, the KS statistic is 0.0987, meaning that's the largest difference observed. To interpret this statistic, it is generally compared against critical values or used in bootstrap simulations. Keeping these concepts in mind is essential when trying to validate assumptions about data distributions.
The bootstrap simulation allows us to see how this KS statistic behaves and whether it supports the assumption that the data follows a normal distribution.
Normal distribution
The mean is the "center" of the distribution, and the standard deviation measures the spread or "width" of the distribution. A key property of the normal distribution is symmetry around the mean. This means that values are evenly distributed around the mean, with a predictable pattern: about 68% of values fall within one standard deviation.
When performing statistical tests like the KS test, the normal distribution provides a model to compare against the empirical distribution of sampling data. Determining if the empirical data differs significantly from this model can help validate assumptions about the underlying population distribution.
Empirical distribution function
For each value in the sample set, the EDF gives the proportion of sample values less than or equal to that value. This is plotted as a step function which increases by \(1/n\) with each sample data point, where \(n\) is the number of observations.
In the KS test, the EDF of the sample is compared against the CDF of a theoretical distribution, such as the normal distribution, to see if there are any significant discrepancies. Large discrepancies indicate differences between the sample distribution and the theoretical model. Understanding the EDF is crucial in performing and interpreting results from the KS test, as it forms the basis of the observed differences.
Sample mean and standard deviation
The sample mean \(\bar{x}\) is an estimate of the population mean \(\mu\), calculated by averaging all the sample observations. It gives us an idea of the "central tendency" of the data, showing us where most data points are likely to be found.
The sample standard deviation \(s\) measures the "spread" of the data around the mean. It reflects how much the individual data points deviate from the mean, and is foundational in calculating the variability of a dataset.
In a bootstrap simulation, recalculating these parameters for each resampled dataset allows for repeated estimation and comparison. This helps in understanding the variability of KS distances across different simulated samples, reinforcing assumptions about the data distribution.