A blog by Rob J Hyndman 

Twitter Gplus RSS

Transforming data with zeros

Published on 13 August 2010

I’m cur­rently work­ing with a hydrol­o­gist and he raised a ques­tion that occurs quite fre­quently with real data — what do you do when the data look like they need a log trans­for­ma­tion, but there are zero values?

I asked the ques­tion on stats​.stack​ex​change​.com and received some use­ful sug­ges­tions. What fol­lows is a sum­mary based on these answers, my own expe­ri­ence, plus a few papers I dis­cov­ered that deal with the topic. In gen­eral, the most appro­pri­ate course of action depends on the model and the con­text. Zeros can arise for sev­eral dif­fer­ent rea­sons each of which may have to be treated differently.

Box-​​Cox (BC) transformations

There is a two-​​parameter ver­sion of the Box-​​Cox trans­for­ma­tion that allows a shift before transformation:

    \[ g(y;\lambda_{1}, \lambda_{2}) = \begin{cases} \frac {(y+\lambda_{2})^{\lambda_1} - 1} {\lambda_{1}} & \mbox{when } \lambda_{1} \neq 0 \\\ \log (y + \lambda_{2}) & \mbox{when } \lambda_{1} = 0 \end{cases}. \]

The usual Box-​​Cox trans­for­ma­tion sets \lambda_2=0. One com­mon choice with the two-​​parameter ver­sion is \lambda_1=0 and \lambda_2=1 which has the neat prop­erty of map­ping zero to zero. There is even an R func­tion for this: log1p().  More gen­er­ally, both para­me­ters can be esti­mated. In R, the boxcox.fit() func­tion in pack­age geoR will fit the parameters.

Alter­na­tively, when \lambda_1=0, it has been sug­gested that \lambda_2 should be approx­i­mately one half of the small­est, non-​​zero value. Another sug­ges­tion is that \lambda_2 should be the square of the first quar­tile divided by the third quar­tile (Sta­hel,  2002).

I’ve used func­tions like this sev­eral times includ­ing in Hyn­d­man & Grun­wald (2000) where we used \log(y+\lambda_2) applied to daily rain­fall data.

One sim­ple spe­cial case is the square root where \lambda_2=0 and \lambda_1=0.5. This works fine with zeros (although not with neg­a­tive val­ues). How­ever, often the square root is not a strong enough trans­for­ma­tion to deal with the high lev­els of skew­ness seen in real data.

Inverse hyper­bolic sine (IHS) transformation

An alter­na­tive trans­for­ma­tion fam­ily was pro­posed by John­son (1949) and is defined by

    \[ f(y,\theta) = \text{sinh}^{-1}(\theta y)/\theta = \log(\theta y + (\theta^2y^2+1)^{1/2})/\theta, \]

where \theta>0. For any value of \theta, zero maps to zero. There is also a two para­me­ter ver­sion allow­ing a shift, just as with the two-​​parameter BC trans­for­ma­tion. Bur­bidge, Magee and Robb (1988) also dis­cuss the IHS trans­for­ma­tion includ­ing esti­ma­tion of \theta.

The IHS trans­for­ma­tion works with data defined on the whole real line includ­ing neg­a­tive val­ues and zeros. For large val­ues of y it behaves like a log trans­for­ma­tion, regard­less of the value of \theta (except 0). As \theta\rightarrow0, f(y,\theta)\rightarrow y.

Mixed mod­els

For con­tin­u­ous data, there can be a dis­crete spike at zero which can be asso­ci­ated with the sen­si­tiv­ity of the mea­sure­ments. For exam­ple in wind energy, wind below 2m/​s is often recorded as zero and the dis­tri­b­u­tion of wind energy pro­duced is con­tin­u­ous with a spike at zero.

With rain­fall data, there is a spike at zero for a dif­fer­ent rea­son — it didn’t rain. These are gen­uine zeros (rather than inde­tectably small values).

With insur­ance data, a sim­i­lar phe­nom­e­non occurs — the dis­tri­b­u­tion of claims is con­tin­u­ous with a large spike at zero.

A fourth exam­ple might be income data — zero if some­one is not in paid work, but a con­tin­u­ous pos­i­tive value otherwise.

In each of these cases, a mix­ture model is prob­a­bly the most appro­pri­ate where part of the model deter­mines the prob­a­bil­ity of a zero, and the other part of the model deter­mines the dis­tri­b­u­tion of the data when it is pos­i­tive. We also used some­thing like this in Hyn­d­man and Grun­wald (2000).


Related Posts:


 
Tags:
3 Comments  comments 
  • Pingback: Tweets that mention Transforming data with zeros | Research tips -- Topsy.com

  • http://johnramey.net John Ramey

    Have you con­sid­ered the Yeo-​​Johnson power trans­for­ma­tion? It was pub­lished in Bio­metrika in 2000, and the R pack­age VGAM includes a yeo.johnson func­tion. Also, it is an exten­sion of Box Cox.

    It was designed for the spe­cial cases that you have mentioned.

    Do you have any sug­ges­tions on mul­ti­vari­ate approaches to the same issue with­out resort­ing to a series of uni­vari­ate transformations?

    Thanks.

    • http://robjhyndman.com Rob J Hyndman

      I didn’t know about that fam­ily of trans­for­ma­tions. Thanks for the tip.

      I’ve never con­sid­ered the prob­lem for mul­ti­vari­ate data.