Friday 8 March 2013

Exponential Distributions, Part 2 - The Tail Wags the Dog

In Part 1, we looked at the asymmetry of exponential distributions, particularly Erlang Distributions.  We concluded that in terms of randomness, bad luck will come more frequently than good luck.  Now in Part 2, we look at the magnitude of that bad luck.  It's not just about the number of bad luck events, but how large those individual events can be and how much effect they can have on the behaviour of a system.

Exponential Attribute #2:  Outliers Cannot Be Ignored
Or, The Tail Wags the Dog


What is an outlier?

An outlier is a statistical term for an observation that is markedly different from the other observations in the dataset.  It is ultimately a subjective definition even though they are often determined using statistical formulas.  Common practice uses +/- 3 standard deviations from the mean as the boundary for outliers, which is reasonable for normal distributions.  However, there is no mathematical rule that says this boundary definition is better than any other.

So what is the purpose of identifying outliers?

The primary purpose is to exclude measurement errors or other unusual occurrences that would bias the dataset and lead one to make a wrong conclusion.  That's a worthwhile goal.  However, the opposite danger is to exclude valid measurements as outliers and that too can lead one to make a wrong conclusion.  Excluding outliers is a double-edged sword.

The chart on the right is a box plot of the Michelson-Morley Experiment results from 1887 where the speed of light was measured as the earth moved through the supposed "aether wind" of space.  There were 5 experiments with 20 observations each.  The top line of each box shows the 75th percentile value, the bottom line is the 25th percentile, the middle bold line is the median, and the T's are the maximum and minimum.  The small circles represent outliers as per the boundary definition above.  For experiment #3, four of the observations are deemed to be outliers -- two large-value outliers and two small-value outliers.  As you can see, while statistically they may be considered distant from the rest of the values in their group, three of them are within the normal "inlier" ranges of the other four experiments.

This is a good example of the difficulty and arbitrary nature of defining an outlier.  When you realize that ALL of the variation in this experiment is solely due to measurement error -- the speed of light is not changing -- then it raises the question as to why some measurement errors would be accepted as inliers and other errors would be deemed outliers.  How does one know if the two low-value outliers in experiment #3 are closer to the true value than the other 18 higher values?  As it turns out, the true speed of light is at value 792 in this chart (792 + 299,000 km/s).  That means one of the low-value outliers in Experiment #3 is just as close to being correct as the 25th percentile value.  It was a valid measurement and should not be classified as an outlier.

So even though this dataset was roughly normally distributed, it was still difficult to find the true outliers.  With exponential distributions, it gets even more difficult because of the long tail.  Simply going out 3 standard deviations from the mean does not give you a reasonable boundary for identifying outliers.

For example, using an outlier boundary of +/- 3 standard deviations on a normal distribution would define 0.27% of the events to be outliers, or 1 out of every 370 events.  That means if you were measuring the outdoor temperature once per day, you would expect to see either a high or low outlier about once per year.  If however the temperature happens to follow an Erlang distribution (with k=3, lambda=6), you would define 1.2% of the events to be outliers using the 3 standard deviation rule.  That means you would expect to see an outlier temperature once per quarter.  That's not particularly infrequent.  (Yes I know temperature doesn't follow an Erlang distribution, but humour me for a minute!)  Erlang distributions do not have the heaviest tails either -- other distributions such as the Log-normal or Weibull can have much heavier tails and therefore much greater proportions of their events beyond the 3-sigma boundary.

It's not just the number of outliers that's important to understand, but the size of each outlier as well.  The heavier the tail, the higher the outlier value can be.  In a normal distribution, the rare outlier events will still have values fairly close to the outlier boundary.  In fact, an outlier greater than +/- 6 standard deviations from the mean essentially never occurs in a normal distribution.  Not so with exponential functions.  The tails go on and on, and some outliers can have absolutely huge values.  Using our Erlang function parameters above, we would expect 0.016% of outliers to fall beyond 6 standard deviations above the mean, or 16 of every 100,000 events.  That's not frequent, but it's a far cry from never!  To use our weather analogy one last time (I promise!), that's an incredibly extreme temperature once every 17 years.  Or it's like a "storm of the century" that happens 5 or 6 times a century.

What's important to understand is that these very large and rare events are not outliers.  They are valid, real events that are an inherent characteristic of these exponential distributions.  They are not measurement errors or aberrations from what is expected.  They should not be dismissed but rather expected and planned for.

An Economist Discovers the Exponential World of Healthcare

I saw this play out in stark reality on a project a number of years ago.  I was part of a team doing data projections showing how the retiring baby boomers would affect the provincial healthcare system over 25 years.  One analysis involved predicting the demand for inpatient services.  A semi-retired and respected economics statistician on our team built a spreadsheet model for inpatient demand, crunched the numbers, and declared that inpatient days would go down over the coming decades as the baby boomers retired, and that the total inpatient costs would also drop.  The team leader was thrilled with the good news, but I was skeptical.  I had read a couple studies in peer reviewed journals that made the exact opposite conclusion, namely inpatient days would rise, average length of stay would rise, and costs would increase materially.

I asked the statistician for his model and data to review his analysis.  The data was fine and his model was sound, except for one minor step -- he ignored all of the data points above 3 standard deviations from the mean.  I asked him his reason for this omission of a significant portion of the dataset and his reply was, "Well, that's what I always do."  Obviously his entire career was spent using macro-economic data that was normally distributed, and he got into the habit of eliminating the outliers for every analysis without even thinking about it.  The problem was that one cannot do that with inpatient length-of-stay data, which follows an Erlang distribution.  This economist had used his "outlier magic wand" and made the sickest patients in the province instantly disappear!  And what do you know?  Hospital costs go down when you make the sickest patients disappear!  Unfortunately doctors and hospital administrators don't have that magic wand in their pockets and they know the sickest patients have to be treated, often at significant cost to the system.  Eliminating them from the projection made no sense.

I redid the analysis using all of the inpatient data and it agreed with the published studies, namely that inpatient days, average length-of-stay, and costs would all rise as baby boomers retired.  It took a lot of convincing of the team leader that my analysis was identical to the economist's analysis, except that I included the sick people!


When dealing with exponential distributions in the real world, remember that the tail end of the distribution drives the behaviour of the system.  Do not ignore the outliers.  They are likely real events, they will happen more frequently than you would like, they will be larger than you like, and if you ignore them you will get the wrong answer.

With exponential distributions, the tail wags the dog.

Postscript

After publishing my article this morning, I found an article by Carl Richards at Motley Fool who argues that outliers in normal distributions shouldn't be ignored either.  Seems to be the theme for the week!


2 comments:

  1. I was under the impression that the boxplot of the Michelson-Morley experiments labels outliers on the basis of being more than 1.5 times the IQR from the 25th and 75th percentiles. At least that's the typical standard for boxplots, which are drawn from rank-based statistics. But the IQR standard for identifying outliers - like any - is arbitrary anyway, and your point remains just as valid. And your argument about exponential distributions abd outliers is really important.

    ReplyDelete
    Replies
    1. You are absolutely correct, I got my outlier definitions based on the mean mixed up with the outlier definition based on the median (i.e. rank-based stats). Outliers in a box plot are in fact defined as 1.5 times the inter-quartile range, not 3 standard deviations from the mean.

      Thanks for catching that.

      Delete