IQR or standard deviation for skewed data

The median is very similar to the mean when the distribution of the data is symmetrical, and so occasionally can be used directly in meta-analyses.  However, means and medians can be very different from each other if the data are skewed, and medians are often reported because the data are skewed [see Chapter 9, Section 9.4.5.3].

Interquartile ranges describe where the central 50% of participants’ outcomes lie. When sample sizes are large and the distribution of the outcome is similar to the normal distribution, the width of the interquartile range will be approximately 1.35 standard deviations. In other situations, and especially when the outcomes distribution is skewed, it is not possible to estimate a standard deviation from an interquartile range. Note that the use of interquartile ranges rather than standard deviations can often be taken as an indicator that the outcomes distribution is skewed.

Similarly, a distribution that is skewed to the left  [bunched up toward the right with a "tail" stretching toward the left] typically has a mean smaller than its median. [See //www.amstat.org/publications/jse/v13n2/vonhippel.html for discussion of exceptions.]

[Note that for a symmetrical distribution, such as a normal distribution, the mean and median are the same.]

For a practical example [one I have often given my students]:

Suppose a friend is considering moving to Austin and asks you what houses here typically cost. Would you tell her the mean or the median house price? Housing prices [in Austin, at least -- think of all those Dellionaires] are skewed to the right. Unless your friend is rich, the median housing price would be more useful than the mean housing price [which would be larger than the median, thanks to the Dellioniares' expensive houses].


In fact, many distributions that occur in practical situations are skewed, not symmetric. [For some examples, see the Life is Lognormal! website.]

Implications for Applying Statistical Techniques

How do we work with skewed distributions when so many statistical techniques give information about the mean? First, note that most of these techniques assume that the random variable in question has a distribution that is normal. Many of these techniques are somewhat "robust" to departures from normality -- that is, they still give pretty accurate results if the random variable has a distribution that is not too far from normal. But many common statistical techniques are not valid for strongly skewed distributions. Two possible alternatives are:

I. Taking logarithms of the original variable.

Fortunately, many of the skewed random variables that arise in applications are  lognormal. That means that the logarithm of the random variable is normal, and hence most common statistical techniques can be applied to the logarithm of the original variable. [With robust techniques, approximately lognormal distributions can also be handled by taking logarithms.] However, doing this may require some care in interpretation. There are three common routes to interpretation when dealing with logs of variables.

1. In many fields, it is common to work with the log of the original outcome variable, rather than the original variable. Thus one might do a hypothesis test for equality of the means of the logs of the variables. A difference in the means of the logs will tell you that the original distributions are different, which in some applications may answer the question of interest. 

Many measurement variables evaluated in public health are more or less normally distributed, but some are not. Consider the frequency distribution showing the percentage of people who have 0 to 20 or more drinks per day as shown below.

This distribution is not normal; it is skewed to the right. In this situation, the mean and standard deviation are not appropriate parameters for characterizing this variable. Instead, one should use a median to characterize the central tendency and interquartile range to indicate variability.

The Median

The median is the middle value, i.e., the value at which half of the measurements are below that value and half are above. For the seven systolic blood pressure measurements in the table above, the median value is 121 mm Hg.

100  110     114     121     130     130     160

To find the median one can sort the values and find the middle value if the number of values is odd; If the number of values is even, the median is the average of the two middle values. However, it is easier to let Excel compute this, particularly with a large number of subjects.

Interquartile Range [IQR]

The figure below has a normal distribution for which we would use mean and standard deviation instead of median and IQR. However, for ease of illustration we will use this normal distribution to explain quartiles and interquartile range that would be used for a skewed distribution. [Note that the mean and median will be similar in a symmetrical distribution like this.]

To figure out the quartiles and IQR manually, you would first rank the observations from smallest to greatest and then divide the data set into four equal parts. These are the quartiles,each of which has an equal [or nearly equal] number of observations. Half of the observations will be below the median, and half will be above.

The 1st quartile [Q1] has the lowest 25% of observations, defined by finding the middle value between the lowest and median values. The 4th quartile [Q4] has the highest 25% of observations and is defined by finding the middle value between the median and the highest value in the data set. The 2nd quartile [Q2] has the 25% between the 1st quartile and the median, and the 3rd quartile [Q3] has the 25% between the median and the 4th quartile. The interquartile range [IQR] is the range for the middle 50% of the data, i.e., between the top of Q1 and the top of Q3.

IQR=Q3-Q1

Outliers

Outliers are extreme values. Data points are outliers if they meet either of these definitions:

  • For a more or less normal distribution, outliers are values more than 3 standard deviations above the mean or less than 3 standard deviations below the mean.
  • For non-normal distributions, outliers are:
    • Outliers are values >Q3 + 1.5[IQR]
  • Outliers are values

Chủ Đề