IQR or standard deviation for skewed data

The median is very similar to the mean when the distribution of the data is symmetrical, and so occasionally can be used directly in meta-analyses.  However, means and medians can be very different from each other if the data are skewed, and medians are often reported because the data are skewed (see Chapter 9, Section 9.4.5.3).

Interquartile ranges describe where the central 50% of participants’ outcomes lie. When sample sizes are large and the distribution of the outcome is similar to the normal distribution, the width of the interquartile range will be approximately 1.35 standard deviations. In other situations, and especially when the outcomes distribution is skewed, it is not possible to estimate a standard deviation from an interquartile range. Note that the use of interquartile ranges rather than standard deviations can often be taken as an indicator that the outcomes distribution is skewed.

Similarly, a distribution that is skewed to the left  (bunched up toward the right with a "tail" stretching toward the left) typically has a mean smaller than its median. (See http://www.amstat.org/publications/jse/v13n2/vonhippel.html for discussion of exceptions.)

(Note that for a symmetrical distribution, such as a normal distribution, the mean and median are the same.)

For a practical example (one I have often given my students):

Suppose a friend is considering moving to Austin and asks you what houses here typically cost. Would you tell her the mean or the median house price? Housing prices (in Austin, at least -- think of all those Dellionaires) are skewed to the right. Unless your friend is rich, the median housing price would be more useful than the mean housing price (which would be larger than the median, thanks to the Dellioniares' expensive houses).


In fact, many distributions that occur in practical situations are skewed, not symmetric. (For some examples, see the Life is Lognormal! website.)

Implications for Applying Statistical Techniques

How do we work with skewed distributions when so many statistical techniques give information about the mean? First, note that most of these techniques assume that the random variable in question has a distribution that is normal. Many of these techniques are somewhat "robust" to departures from normality -- that is, they still give pretty accurate results if the random variable has a distribution that is not too far from normal. But many common statistical techniques are not valid for strongly skewed distributions. Two possible alternatives are:

I. Taking logarithms of the original variable.

Fortunately, many of the skewed random variables that arise in applications are  lognormal. That means that the logarithm of the random variable is normal, and hence most common statistical techniques can be applied to the logarithm of the original variable. (With robust techniques, approximately lognormal distributions can also be handled by taking logarithms.) However, doing this may require some care in interpretation. There are three common routes to interpretation when dealing with logs of variables.

1. In many fields, it is common to work with the log of the original outcome variable, rather than the original variable. Thus one might do a hypothesis test for equality of the means of the logs of the variables. A difference in the means of the logs will tell you that the original distributions are different, which in some applications may answer the question of interest. 

Many measurement variables evaluated in public health are more or less normally distributed, but some are not. Consider the frequency distribution showing the percentage of people who have 0 to 20 or more drinks per day as shown below.

IQR or standard deviation for skewed data

This distribution is not normal; it is skewed to the right. In this situation, the mean and standard deviation are not appropriate parameters for characterizing this variable. Instead, one should use a median to characterize the central tendency and interquartile range to indicate variability.

The Median

The median is the middle value, i.e., the value at which half of the measurements are below that value and half are above. For the seven systolic blood pressure measurements in the table above, the median value is 121 mm Hg.

100  110     114     121     130     130     160

To find the median one can sort the values and find the middle value if the number of values is odd; If the number of values is even, the median is the average of the two middle values. However, it is easier to let Excel compute this, particularly with a large number of subjects.

Interquartile Range (IQR)

The figure below has a normal distribution for which we would use mean and standard deviation instead of median and IQR. However, for ease of illustration we will use this normal distribution to explain quartiles and interquartile range that would be used for a skewed distribution. (Note that the mean and median will be similar in a symmetrical distribution like this.)

IQR or standard deviation for skewed data

To figure out the quartiles and IQR manually, you would first rank the observations from smallest to greatest and then divide the data set into four equal parts. These are the quartiles,each of which has an equal (or nearly equal) number of observations. Half of the observations will be below the median, and half will be above.

The 1st quartile (Q1) has the lowest 25% of observations, defined by finding the middle value between the lowest and median values. The 4th quartile (Q4) has the highest 25% of observations and is defined by finding the middle value between the median and the highest value in the data set. The 2nd quartile (Q2) has the 25% between the 1st quartile and the median, and the 3rd quartile (Q3) has the 25% between the median and the 4th quartile. The interquartile range (IQR) is the range for the middle 50% of the data, i.e., between the top of Q1 and the top of Q3.

IQR=Q3-Q1

Outliers

Outliers are extreme values. Data points are outliers if they meet either of these definitions:

  • For a more or less normal distribution, outliers are values more than 3 standard deviations above the mean or less than 3 standard deviations below the mean.
  • For non-normal distributions, outliers are:
    • Outliers are values >Q3 + 1.5(IQR)
  • Outliers are values

Box and Whisker Plots

A box and whisker plot is a way of summarizing skewed data. It gives a sense of the shape of the distribution, the central tendency, and the degree of variability. You will not have to make box and whisker plots for this course.

IQR or standard deviation for skewed data

Example of Box and Whisker Plots Used for Comparison
Carl and Angela work in a computer store and want to compare the number of sales they made for the past 12 months.

In the past 12 months Angela sold
34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37
(Ordered: 1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57)

In the past 12 months Carl sold
51, 17, 25, 39, 7, 49, 62, 41, 20, 6, 43, 13.
(Ordered: 6, 7, 13, 17, 20, 25, 39, 41, 43, 49, 51, 62)

After ordering the data, it can be summarized as follows:

IQR or standard deviation for skewed data
Image adapted from Statistics Canada)

Summary: Carl’s highest and lowest sales are both higher than Angela’s is, and Carl’s median sales figure is higher too. During the past year, Carl consistently sold more computers than Angela.

Can standard deviation be used on skewed data?

If data have a very skewed distribution, then the standard deviation will be grossly inflated, and is not a good measure of variability to use.

What is the best measure for a skewed distribution?

In a skewed distribution, the median is often a preferred measure of central tendency, as the mean is not usually in the middle of the distribution.

Is standard deviation or IQR a better measure of dispersion?

Standard Deviation (s) It is the better measure of dispersion compared to range and IQR because unlike range and IQR, the Standard deviation utilizes all the values in the data set in its calculation. The square of the standard deviation is called Variance(s2).

Does IQR show skewness?

The IQR can be used to identify outliers (see below). The IQR also may indicate the skewness of the dataset. The quartile deviation or semi-interquartile range is defined as half the IQR.