# Histogram

## Description

Histograms are graphs of a distribution of data designed to show centering, dispersion (spread), and shape (relative frequency) of the data. Histograms can provide a visual display of large amounts of data that are difficult to understand in a tabular, or spreadsheet form.

They are used to understand how the output of a process relates to customer expectations (targets and specifications), and help answer the question: "Is the process capable of meeting customer requirements?"

## Example

To understand the application of histograms, consider a simple example: height data were collected from a training class of \$$50\$$ individuals, as shown on the following table:

Individual Height, Measured in Inches
69.9 68.9 68.2 66.0 71.0
69.0 70.0 68.5 66.5 72.5
69.6 69.5 70.0 67.5 73.0
68.5 70.4 66.8 68.3 69.0
65.0 71.1 69.0 68.2 71.3
65.9 71.0 69.3 69.1 68.2
67.2 72.5 69.1 70.2 68.5
67.5 73.1 69.4 69.5 70.0
68.0 68.8 68.5 70.5 67.0
68.6 71.3 65.5 70.8 69.2

There are only \$$50\$$ measurements, but it is difficult to draw specific conclusions about the data without further analysis. A Histogram can be constructed to provide more usable information:

The Histogram graph gives a quick visual summary of the data. It is easy to see that the average height is around \$$69\$$ inches, with few people shorter than \$$66\$$ inches, and few as tall as \$$72\$$ inches. The frequency distribution tells us a lot about probability.

If this sample was representative of the overall population of employees (randomly drawn), and if the sample is large enough, we could conclude that the probability of finding employees who are over \$$73\$$ inches or under \$$65\$$ inches is low.

The shape of this distribution, which is common both in nature and industrial settings is a "Normal Distribution", which looks like a bell-shaped curve. The histogram below is overlaid with a normal curve.

There are other distribution shapes that you may encounter:

## How to Start

The first step in constructing a histogram is to decide how the process should be measured - what data should be collected. The data must be Variable Data, or that which is measured on a continuous scale, such as: volume, size, weight, time, temperature.

Next, gather the data. As a rule of thumb, over \$$50\$$ data points should be collected in order to see meaningful patterns. You can use historical information to establish a baseline (if the measurement method was exactly the same), and you may wish to compare samples drawn from different shifts or time periods.

Now that you have gathered the data, it should be put into a tabular form, such as a spreadsheet. You can then construct a histogram by several methods. The preferable method is to use a statistical software package. Virtually all of them will accept data copied from a spreadsheet.

You can also use the charting function of your spreadsheet program, but you may need to organize the data and calculate the charting intervals. If you choose this route, use the following sequence:

1. Count the number of data points (\$$50\$$ in our height example).
2. Determine the range of the sample - the difference between the highest and lowest values (\$$73.1-65\$$, or \$$8.1\$$ inches in our height example).
3. Determine the number of class intervals.

You can use either of two methods as general guidelines in determining the number of intervals:

• A. Use \$$10\$$ intervals as a rule of thumb.
• B. Calculate the square root of the number of data points and round to the nearest whole number. In the case of our height example, the square root of \$$50\$$ is \$$7.07\$$, or \$$7\$$ when rounded.
You may wish to experiment with different interval numbers. If there are too many, the distribution will spread out, and the histogram will look flat. Likewise, if there are too few intervals, the distribution can look artificially tight.
4. Determine the interval class width by one of two methods:
• A. \$$\text{Width} = \dfrac{\text{Range}}{\# \text{of Intervals}} = \dfrac{8.1}{10} = 0.81\$$
• B. Divide the Standard Deviation by three. In this case, the height data has a Standard Deviation of \$$1.85\$$, which yields a class interval size of \$$0.62\$$ inches, and therefore a total of \$$14\$$ class intervals (Range of \$$8.1\$$ divided by \$$0.62\$$, rounded up). This is slightly more class intervals than our rule of thumb indicated.
5. Develop a table or spreadsheet with relative frequencies for each interval, which becomes a tabular histogram:
Class Height Intervals Frequency Total
1 64.4 - 65.0 X 1
2 65.1 - 65.7 X 1
3 65.8 - 66.4 XX 2
4 66.5 - 67.1 XX 2
5 67.2 - 67.8 XXXX 4
6 67.9 - 68.5 X 1
7 68.6 - 69.2 XXXXXXXXXX 10
8 69.3 - 69.9 XXXXXXXXX 9
9 70.0 - 70.6 XXXXXXX 7
10 70.7 - 71.3 XXX 3
11 71.4 - 72.0 XXXXXX 6
12 72.1 - 72.7 0
13 72.8 - 73.4 XX 2
14 73.5 - 74.1 XX 2

Once the histogram is developed, you can analyze the data with regard to customer expectations (specifications). You can see from the following graphic that the first histogram of a process sample falls within the specifications, while the second has a portion of the histogram outside of the specifications.

The second histogram has too much dispersion, or variability, to meet customer expectations. The indication is that action must be taken to make the output more consistent, or some number of defects will be produced.

A more advanced form of this analysis is the Cp metric, which is covered in the Process Capability section of the Statistical Process Control module within the Toolbox.

After assessing dispersion, or process spread, you can also analyze process centering. A process output distribution that is narrow enough to fall between the upper and lower specifications must also be centered in order to do so. Often times it is much easier to center a process than to reduce its spread, or dispersion.

Centering may be a function of machine or tool settings, whereas the reduction of variability may require multiple actions to address multiple root causes.

The degree to which a stable process is both centered, and within specifications, is reflected by a metric called Cpk, which is also covered in the Statistical Process Control module of the Toolbox. Assessment of Cpk requires the collection of data over time to demonstrate statistical control, or stability.

## Summary

The histogram tool is a common tool for understanding data and the characteristics of data. Knowing how to correctly read a histogram graph can greatly assist process improvement efforts. Because of a histogram's common use it also makes an excellent graphic for representing data during presentations.