Problem 4

Question

The ordered software data (see also Table 15.3) are given in the following list. $$ \begin{array}{rrrrrrrrrr} 0 & 0 & 0 & 2 & 4 & 6 & 8 & 9 & 10 & 10 \\ 10 & 12 & 15 & 15 & 16 & 21 & 22 & 24 & 26 & 30 \\ 30 & 31 & 33 & 36 & 44 & 50 & 55 & 58 & 65 & 68 \\ 75 & 77 & 79 & 81 & 88 & 91 & 97 & 100 & 108 & 108 \\ 112 & 113 & 114 & 115 & 120 & 122 & 129 & 134 & 138 & 143 \\ 148 & 160 & 176 & 180 & 193 & 193 & 197 & 227 & 232 & 233 \\ 236 & 242 & 245 & 255 & 261 & 263 & 281 & 290 & 296 & 300 \\ 300 & 325 & 330 & 357 & 365 & 369 & 371 & 379 & 386 & 422 \\ 445 & 446 & 447 & 452 & 457 & 482 & 529 & 529 & 543 & 600 \\ 648 & 670 & 700 & 707 & 724 & 729 & 748 & 790 & 810 & 816 \\ 828 & 843 & 860 & 865 & 868 & 875 & 943 & 948 & 983 & 990 \\ 1011 & 1045 & 1064 & 1071 & 1082 & 1146 & 1160 & 1222 & 1247 & 1351 \\ 1435 & 1461 & 1755 & 1783 & 1800 & 1864 & 1897 & 2323 & 2930 & 3110 \\ 3321 & 4116 & 5485 & 5509 & 6150 & & & & & \end{array} $$ a. Compute the heights on each bin of the histogram with bins $[0,500]$, $(500,1000]$, and so on. b. Compute the value of the empirical distribution function in the endpoints of the bins. c. Check that the area under the histogram on bin $(1000,1500]$ is equal to the increase $F_{n}(1500)-F_{n}(1000)$ of the empirical distribution function on this bin. Actually, this is true for each single bin (see Exercise 15.11).

Step-by-Step Solution

Verified

Answer

The instructions allow organizing data, counting in bins, computing empirical functions, and verifying area relations.

1Step 1: Organize data into bins

Identify the range and assign each data point to its respective bin: - Bin 1: $[0, 500]$- Bin 2: $(500, 1000]$- Bin 3: $(1000, 1500]$- Continue assigning points to subsequent bins based on their range.

2Step 2: Count data points in each bin

For each bin, count the number of data points that fall within its range by scanning through the list: - Bin 1: Count numbers from 0 to 500. - Bin 2: Count numbers from 501 to 1000. - Continue counting for each bin assignment.

3Step 3: Calculate histogram heights

The height of each bin in the histogram is the count of data points divided by the total number of data points (109 in total). This provides the probability density for each bin. Compute: - Height for $[0, 500]$ by dividing its count by 109.- Height for $(500, 1000]$ by dividing its count by 109.- Repeat for additional bins as needed.

4Step 4: Compute empirical distribution function values

Determine the value of the empirical distribution function at the endpoints of each bin:- Start from 0 and add the cumulative probability (fraction of points in each bin). - Obtain $F_n(x)$ values at endpoints like 500, 1000, etc.- Example: $F_n(500)$ is the cumulative probability up to 500, and so on.

5Step 5: Verify area under histogram for bin (1000,1500]

For the bin $(1000, 1500]$, calculate the area under the histogram and verify it against the increase in $F_n$:- Area is the height of the bin $(1000,1500]$ multiplied by its width.- Check that this area is equal to $F_n(1500) - F_n(1000)$ using the previously calculated cumulative probabilities.

Key Concepts

Histogram AnalysisProbability DensityEmpirical Cumulative Distribution Function (ECDF)

Histogram Analysis

Histograms are a wonderful tool for visualizing the distribution of data within specified intervals, also known as bins. Imagine dividing the span of your data into several contiguous intervals and stacking each interval with bars based on how many data points fall into them. These bars visibly demonstrate how often different ranges of values occur in your data set.

Creating a histogram involves several steps:

Choose suitable bin ranges: The range you select influences how your data distribution looks. For example, in our problem, bins such as $[0, 500]$ and $(500, 1000]$ are used.
Count data points in each bin: For each bin, like $[0, 500]$, you count how many data points lie in this interval. This count represents how dense this specific interval is with respect to your entire data set.
Calculate the height of the bars: Each bin's height is computed by dividing the count of data points in the bin by the total number of data points, giving a probability density expression.

The histogram thus not only shows how data is distributed but also provides a visual impression of the probability density across different value ranges.

Probability Density

Probability density helps us understand distributions over continuous ranges, where individual probabilities are not as straightforward as in discrete data. Imagine a continuous random variable whose presence spreads across real number intervals, but never atop precise single points.

When engaged with a histogram, each bar's height essentially represents a type of probability density for the data that fall within that bin. It informs us of how 'densely packed' the data is in that specific section.

Formula Insight: The height of each histogram bar is determined by dividing the number of data points in a bin by the total count of the data. Therefore, the probability density function in each section is
\[ ext{Height} = \frac{\text{Number of Data Points in Bin}}{\text{Total Data Points}} \]
Integral of probabilities: Across an entire distribution, if you were to sum up or integrate these probabilities over all possible outcomes, it should equal 1. This allows histograms in probability density functions to provide insights into the continuous likelihoods across segments of data.

Understanding probability density let us infer how likely we are to find observations in specific ranges and gives foundational insight into how our data is spread throughout different intervals.

Empirical Cumulative Distribution Function (ECDF)

The ECDF is a stepping stone towards understanding how data values accumulate over a data set. Imagine this as a running total of probabilities as you glide through ordered data values. Starting from nothing, you add the probability portion of each successive value, leading eventually to a sum of 1 at the data set's maximum value.

Here's how the ECDF translates within our context:

Starting point: Begin with an initial probability of 0 for any value lower than the minimum data value.
Cumulative calculation: Progressively add the probability mass from each new bin. So at any endpoint, the ECDF gives the total probability of data points falling within that range or less.
Endpoint values: For certain bin thresholds such as 500, 1000, etc., the ECDF values represent cumulative probabilities, like the portion of data less than or equal to these thresholds had been accounted for.

By providing these accumulated visual cues, the ECDF helps us see not just the distribution at any point but also how these points have built up leading to any given value. This insight into full data spectrum provides a richer, more comprehensive visual on data flow and likelihood across all observed values.

Problem 3

Problem 5

Other exercises in this chapter

Problem 1

In [33] Stephen Stigler discusses data from the Edinburgh Medical and Surgical Journal (1817). These concern the chest circumference of 5732 Scottish soldiers,

View solution

Problem 3

In an article in Biometrika, an example is discussed about mine disasters during the period from March 15,1851 , to March, 22,1962 . A dataset has been obtained

View solution

Problem 5

Suppose we construct a histogram with bins $[0,1],(1,3],(3,5],(5,8]$, $(8,11],(11,14]$, and $(14,18]$. Given are the values of the empirical distribution

View solution

Problem 6

Given is the following information about a histogram: $$ \begin{array}{cc} \hline \hline \text { Bin } & \text { Height } \\ \hline(0,2] & 0.245 \\ (2,4] & 0.13

View solution