Problem 11

Question

Consider the dataset of measured chest circumferences of 5732 Scottish soldiers (see Exercises 15.1, 17.6, and 18.9). The Kolmogorov-Smirnov distance between the empirical distribution function and the distribution function $F_{\bar{x}_{n}, s_{n}}$ of the normal distribution with estimated parameters $\hat{\mu}=\bar{x}_{n}=39.85$ and $\hat{\sigma}=s_{n}=2.09$ is equal to $$ t_{\mathrm{ks}}=\sup _{a \in \mathbb{R}}\left|F_{n}(a)-F_{\bar{x}_{n}, s_{n}}(a)\right|=0.0987, $$ where $\bar{x}_{n}$ and $s_{n}$ denote sample mean and sample standard deviation of the dataset. Suppose we want to perform a bootstrap simulation with one thousand repetitions for the KS distance to investigate to which degree the value $0.0987$ agrees with the assumed normality of the dataset. Describe the appropriate bootstrap simulation that must be carried out.

Step-by-Step Solution

Verified

Answer

Perform a bootstrap simulation with 1000 samples to compare KS distances.

1Step 1: Understand Objectives

The goal is to perform a bootstrap simulation to examine how well the Kolmogorov-Smirnov distance of 0.0987 supports the assumption that the dataset is normally distributed. Bootstrap simulations can help us understand the distribution of the KS statistic under the null hypothesis of normality.

2Step 2: Generate Bootstrap Samples

Resample the dataset with replacement to create 1000 bootstrap samples, each having the same size as the original dataset (5732 observations). This will allow us to simulate the process of estimating the normal distribution parameters repeatedly.

3Step 3: Calculate Estimated Parameters

For each bootstrap sample, compute the sample mean $\bar{x}_{i}$ and sample standard deviation $s_{i}$. This step provides the parameters for the theoretical normal distribution that you will compare each empirical distribution against.

4Step 4: Compute KS Distance for Each Sample

For each of the 1000 bootstrap samples, calculate the Kolmogorov-Smirnov distance $t_{\mathrm{ks}, i}$. This involves finding the maximum absolute difference between the empirical cumulative distribution function of the bootstrap sample and the cumulative distribution function of the standard normal distribution with parameters $\bar{x}_{i}$ and $s_{i}$.

5Step 5: Analyze the Bootstrap Distribution

After obtaining 1000 KS distances from the bootstrap samples, analyze the distribution of these distances. This can be done by plotting a histogram or summary statistics such as the mean and standard deviation of the KS distances.

6Step 6: Compare with Observed KS Distance

Determine the percentile of the observed KS distance (0.0987) within the distribution of bootstrap KS distances. If the observed value is in the tails of the distribution, this may indicate that the empirical distribution is significantly different from the normal distribution with parameters $\bar{x}_{n}$ and $s_{n}$.

Key Concepts

Kolmogorov-Smirnov testNormal distributionEmpirical distribution functionSample mean and standard deviation

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov (KS) test is a non-parametric test that helps us determine whether a sample comes from a specified distribution. It's particularly useful for comparing a sample distribution to a normal distribution, which is what we are doing with the Scottish soldiers' chest circumferences. The test measures the maximum distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution.

This distance is called the KS statistic. In this case, the KS statistic is 0.0987, meaning that's the largest difference observed. To interpret this statistic, it is generally compared against critical values or used in bootstrap simulations. Keeping these concepts in mind is essential when trying to validate assumptions about data distributions.

The bootstrap simulation allows us to see how this KS statistic behaves and whether it supports the assumption that the data follows a normal distribution.

Normal distribution

The normal distribution is a common distribution in statistics, often referred to as a "bell curve" due to its shape. It is characterized by two parameters: the mean ($\mu$) and the standard deviation ($\sigma$). In our exercise, the normal distribution is used as a reference point for determining the normality of the soldiers' chest circumferences.

The mean is the "center" of the distribution, and the standard deviation measures the spread or "width" of the distribution. A key property of the normal distribution is symmetry around the mean. This means that values are evenly distributed around the mean, with a predictable pattern: about 68% of values fall within one standard deviation.

When performing statistical tests like the KS test, the normal distribution provides a model to compare against the empirical distribution of sampling data. Determining if the empirical data differs significantly from this model can help validate assumptions about the underlying population distribution.

Empirical distribution function

The empirical distribution function (EDF) is a step function estimate of the cumulative distribution function (CDF) of a sample. It provides a way to assess how sample data is distributed.

For each value in the sample set, the EDF gives the proportion of sample values less than or equal to that value. This is plotted as a step function which increases by $1/n$ with each sample data point, where $n$ is the number of observations.

In the KS test, the EDF of the sample is compared against the CDF of a theoretical distribution, such as the normal distribution, to see if there are any significant discrepancies. Large discrepancies indicate differences between the sample distribution and the theoretical model. Understanding the EDF is crucial in performing and interpreting results from the KS test, as it forms the basis of the observed differences.

Sample mean and standard deviation

The sample mean and standard deviation are essential statistics in summarizing data and estimating parameters for a normal distribution.

The sample mean $\bar{x}$ is an estimate of the population mean $\mu$, calculated by averaging all the sample observations. It gives us an idea of the "central tendency" of the data, showing us where most data points are likely to be found.

The sample standard deviation $s$ measures the "spread" of the data around the mean. It reflects how much the individual data points deviate from the mean, and is foundational in calculating the variability of a dataset.

In a bootstrap simulation, recalculating these parameters for each resampled dataset allows for repeated estimation and comparison. This helps in understanding the variability of KS distances across different simulated samples, reinforcing assumptions about the data distribution.

Problem 10

Other exercises in this chapter

Problem 6

Suppose that the dataset $x_{1}, x_{2}, \ldots, x_{n}$ is a realization of a random sample from an $\operatorname{Exp}(\lambda)$ distribution with distribut

View solution

Problem 10

Consider the software data, with average $\bar{x}_{n}=656.8815$, modeled as a realization of a random sample $X_{1}, X_{2}, \ldots, X_{n}$ from a distributi

View solution

Problem 5

$\square$ Suppose we have a dataset $\begin{array}{lll}0 & 3 & 6\end{array}$ which is the realization of a random sample from a distribution function $F$.

View solution