# Winsorization

Winsorization replaces extreme data values with less extreme values.

## But why

Extreme values sometimes have a big effect on statistical operations.  That effect is not necessarily a good effect.  One approach to the problem is to change the statistical operation — this is the field of robust statistics.

An alternative solution is to just change the data.  You can then use whatever statistical procedure you want.

In my experience in finance only mildly robust statistics (and hence only mildly winsorized data) are called for.  There seems to be a surprising amount of information in the tails of financial returns.

## Trimming

There is an alternative to winsorization, which is just throwing out the extreme values.  That is called “trimming”.  The mean function in R has a trim argument so that you can easily get trimmed means:

> mean(c(1:10, 300))
 32.27273
> mean(c(1:10, 300), trim=.05)
 32.27273
> mean(c(1:10, 300), trim=.1)
 6

Trimming removes a certain fraction of the data from each tail.

## Winsorizing — one way

One approach to winsorization is just to copy trimming, but replace the extreme values rather than throw them out.  Here is an R function that does this:
> winsor1
function (x, fraction=.05)
{
if(length(fraction) != 1 || fraction < 0 ||
fraction > 0.5) {
}
lim <- quantile(x, probs=c(fraction, 1-fraction))
x[ x < lim ] <- lim
x[ x > lim ] <- lim
x
}

Figures 1 and 2 show this function in action.

Figure 1: The winsor1 function with some normally distributed data. Figure 2: The winsor1 function with some Cauchy distributed data. ## Winsorizing — another way

Another approach to winsorization is to try to just move the datapoints that are likely to be troublesome.  That is, only move data that are too far from the rest.  Here is such an R function:

> winsor2
function (x, multiple=3)
{
if(length(multiple) != 1 || multiple <= 0) {
}
med <- median(x)
y <- x - med
sc <- mad(y, center=0) * multiple
y[ y > sc ] <- sc
y[ y < -sc ] <- -sc
y + med
}

Figures 3 and 4 show the results of this function using the same data as in Figures 1 and 2.

Figure 3: The winsor2 function with some normally distributed data. Figure 4: The winsor2 function with some Cauchy distributed data. I think the second form of winsorization usually makes more sense.  In the examples the normal data are not changed at all by the second method and the Cauchy data look to be changed in a more logical way.

Production quality implementations of the R functions would probably include an na.rm argument to deal with missing values.

Subscribe to the Portfolio Probe blog by Email

This entry was posted in R language, Statistics and tagged , , . Bookmark the permalink.

### 9 Responses to Winsorization

1. bill_080 says:

#Very little cutoff. 100 data points
x <- rnorm(100)
plot(x, winsor2(x, 3))

#Obvious cutoff. 10000 data points
x <- rnorm(10000)
plot(x, winsor2(x, 3))

#Very little cutoff. 10000 data points
x <- rnorm(10000)
plot(x, winsor2(x, 5))

• Pat says:

Good point that what you want to do may depend on how much data you have.

2. Rob Steele says:

You could get a little more clarity and speed by replacing
``` y[ y > sc ] <- sc y[ y < -sc ] <- -sc ```
with something like
``` y <- pmin(pmax(y, -sc), sc) ```

• Pat says:

Rob,

Thanks. Clarity is an important consideration.

Have you tried timing the two? My intuition says that the subscripting would be faster. But that should almost surely take a backseat to clarity since the time difference is going to be small in any event.

My take on clarity is the reverse of yours as well. I’d be keen to put it to a vote on which is more clear to the rest of the R world.

• Rob Steele says:

It looks like you’re right about indexing being faster. Regarding clarity, this is an idiom I use all the time so it’s instantly recognizable to me. It’s like a low level design pattern. I find these are especially important in R, given its infernal side.

Thanks!

3. Pat says:

Rob,

Thanks for the timing report.

We still don’t have any more clarity on clarity.

4. safa says:

Hi Pat,
My background is biology so could you please tell me the meaning of the “multiple” option in winsor2 function?
Thanks
Safa

• Pat says:

Safa,

`multiple` is multiplying the computed value of `mad` (a robust estimate of the standard deviation). So when `multiple` is 3, we are pulling values in that are farther than 3 (estimated) standard deviations from the center.

An improvement to the function would be to add a `center` argument that defaulted to the median.

• safa says:

thanks, i get it now