# Transforming data with zeros

I’m currently working with a hydrologist and he raised a question that occurs quite frequently with real data — what do you do when the data look like they need a log transformation, but there are zero values?

I asked the question on stats.stackexchange.com and received some useful suggestions. What follows is a summary based on these answers, my own experience, plus a few papers I discovered that deal with the topic. In general, the most appropriate course of action depends on the model and the context. Zeros can arise for several different reasons each of which may have to be treated differently.

### Box-Cox (BC) transformations

There is a two-parameter version of the Box-Cox transformation that allows a shift before transformation:
$$g(y;\lambda_{1}, \lambda_{2}) = \begin{cases} \frac {(y+\lambda_{2})^{\lambda_1} – 1} {\lambda_{1}} & \mbox{when } \lambda_{1} \neq 0 \\ \log (y + \lambda_{2}) & \mbox{when } \lambda_{1} = 0 \end{cases}.$$
The usual Box-Cox transformation sets $\lambda_2=0$. One common choice with the two-parameter version is $\lambda_1=0$ and $\lambda_2=1$ which has the neat property of mapping zero to zero. There is even an R function for this: log1p().  More generally, both parameters can be estimated. In R, the boxcox.fit() function in package geoR will fit the parameters.

Alternatively, when $\lambda_1=0$, it has been suggested that $\lambda_2$ should be approximately one half of the smallest, non-zero value. Another suggestion is that $\lambda_2$ should be the square of the first quartile divided by the third quartile (Stahel,  2002).

I’ve used functions like this several times including in Hyndman & Grunwald (2000) where we used $\log(y+\lambda_2)$ applied to daily rainfall data.

One simple special case is the square root where $\lambda_2=0$ and $\lambda_1=0.5$. This works fine with zeros (although not with negative values). However, often the square root is not a strong enough transformation to deal with the high levels of skewness seen in real data.

### Inverse hyperbolic sine (IHS) transformation

An alternative transformation family was proposed by Johnson (1949) and is defined by
$$f(y,\theta) = \text{sinh}^{-1}(\theta y)/\theta = \log(\theta y + (\theta^2y^2+1)^{1/2})/\theta,$$
where $\theta > 0$. For any value of $\theta$, zero maps to zero. There is also a two parameter version allowing a shift, just as with the two-parameter BC transformation. Burbidge, Magee and Robb (1988) also discuss the IHS transformation including estimation of $\theta.$

The IHS transformation works with data defined on the whole real line including negative values and zeros. For large values of $y$ it behaves like a log transformation, regardless of the value of $\theta$ (except 0). As $\theta\rightarrow0$, $f(y,\theta)\rightarrow y$.

### Mixture models

For continuous data, there can be a discrete spike at zero which can be associated with the sensitivity of the measurements. For example in wind energy, wind below 2m/s is often recorded as zero and the distribution of wind energy produced is continuous with a spike at zero.

With rainfall data, there is a spike at zero for a different reason — it didn’t rain. These are genuine zeros (rather than indetectably small values).

With insurance data, a similar phenomenon occurs — the distribution of claims is continuous with a large spike at zero.

A fourth example might be income data — zero if someone is not in paid work, but a continuous positive value otherwise.

In each of these cases, a mixture model is probably the most appropriate where part of the model determines the probability of a zero, and the other part of the model determines the distribution of the data when it is positive. We also used something like this in Hyndman and Grunwald (2000).

### Related Posts:

• Have you considered the Yeo-Johnson power transformation? It was published in Biometrika in 2000, and the R package VGAM includes a yeo.johnson function. Also, it is an extension of Box Cox.

It was designed for the special cases that you have mentioned.

Do you have any suggestions on multivariate approaches to the same issue without resorting to a series of univariate transformations?

Thanks.

• I didn’t know about that family of transformations. Thanks for the tip.

I’ve never considered the problem for multivariate data.

• Might you post the specific URL for the inquiry on Stats StackExchange (I guess I should say “Cross Validated”) that you mentioned? I use that site, and would like to check if there have been any additional responses in the 2 year interim. Thank you for considering my request!

• Thank you, Rob!

• A.F.M. Kamal Chowdhury

Hi Rob,

Thank you for this very useful post. I have been working with a 60 years daily rainfall data which contains many zero values. I have tried “boxcox.fit() func­tion of pack­age geoR” as you mentioned above to transform my data, but have obtained an error massage like: “Error: could not find function “boxcox.fit”. Can you please help me in this regard?

• It sounds like you haven’t loaded the package. Use library(geoR)

• AFM Kamal Chowdhury

Thank you for your reply. It seems boxcox.fit() function has been changed into boxcoxfit(). Even trying with the later one, I have got error massage as below:

“> boxcoxfit(NARCliM_rainfall_ts)

Error in boxcoxfit(NARCliM_rainfall_ts) :
Transformation requires positive data”

As I have mentioned earlier, I have zero values in my data.

AFM Kamal Chowdhury
MPhil Student
University of Newcastle, Australia

• Elaine Oon

Hi Rob

I found your post on box-cox transformation above very helpful. I have a query which I would like to seek your advice: I am trying to do a box-cox transformation with swift. I have a dependent variable, annual foreign sales of companies (in US$thousands) which contains zeros, for a set of panel data. I have been advised to add a small amount, for example, 0.00001 to the annual foreign sales figures so that I can take the log, but I think box-cox transformation will produce a more appropriate constant than 0.00001. I have done a box-cox transformation on R with the codes below, but it has given me a very large lambda2 of 31162.8. library(geoR) boxcoxfit(bornp$ForeignSales, lambda2 = TRUE)
#R output – Fitted parameters:
# lambda lambda2 beta sigmasq
# -1.023463e+00 3.116280e+04 9.770577e-01 7.140328e-11

My hunch is that the above value of lambda2 is very large, so I am not sure if I need to run the boxcoxfit with my independent variables like below:
boxcoxfit(bornp$ForeignSales, bornp$family bornp$roa bornp$solvencyratio,
lambda2=TRUE)

I am still trying to identify the best set of independent variables, so I am not sure if using the boxcoxfit with independent variables at this stage will work or is best.

Thanks,
Elaine

• Carlos Eduardo

Hi. My model has two independent variables. One of them has zero as a valid value. Is it sound practice to split the dataset in two? The first one with data with both independent variables present non-zero values. The second one with data where the independent variable has zero value. Then, perform a regression on each dataset so the resulting model is comprised of two different equations. Is there some measure I should be aware of about using this appoach? Does it have a name? Thanks in advance.

• Daniel

Hi Rob,

Very useful post, thanks. Do you know if there is any discussion of the practical consequences of the various choices e.g. in terms of bias/variance?

A small comment: the last header should probably read “mixture models”. “Mixed models” means random effects to me rather than zero inflation.

Best, Daniel

P.S. I’ve never commented before but I really enjoy your posts.

• K Denny

Finding a transformation that permits zeros seems to me to be a “mindless” way of handling the issue. Mixture models at least try to model why there might be a bunch of zeros. In econometrics where you are often dealing with zeros, like income, the norm is to use a censored normal/Tobit or a Heckman selection model or one of their generalisations. For count data, a hurdle model makes sense or a zero-inflated model.

• Guest

Thanks for this post and for your related comments on Cross Validated Rob! I gave the inverse sinh transformation a try on some continuous zero inflated data that I have and it produced a nice distribution except for the zero inflation. On the other hand, a log(y+alpha) transform is a little rougher, but solves the zero inflation fairly nicely.

• AngeloDAmbrosio

Hi! William Gould, president of StataCorpm (and some others), says that when you have a non negative dependent variable is ok to use Poisson regression even if the generation process is not Poisson and the 0s are not generated from a different process (like in the mixture cases you shown). The estimates are correct anyway. Just the standard errors are too small but you can use a more robust variance estimator or, like I do, use bootstrap for the error and avoid the problem altogether. Also negative binomial and quasi poisson are good alternatives for the estimation of errors.
http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/

• This technique that you have shared looks pretty helpful and really useful for them to have that will give them the knowledge and another form of technique that will going to help those people who might need to have this kind of idea for their output or work.

• Lino Llamosa Mejia

Dear Rob J. Hyndman

Thank you for your contribution on this field. What happen when you have the following equation on log: ln(Y) = Ln(T) + Ln(M) + Ln(K), on which Y represent Production, T Number of worker, M intermediary material and K capital for a dataset of 3630 observations. The problem is that I have 400 zero values observation of variable=Number of worker. Can I treat with Box-Cox transformation only 1 of those 3 variable? Mean, treat only (T) with Box-Cox and the rest continuous with logarithm natural transformation?

Thank for you value comment

• Dan Lewer

Thanks so much for this. It’s really helpful. I’m sorry to ask a basic question – but do you know how you interpret the intercept and coefficients if you use the log(y + 1) for the independent variable in linear regression? I’ve been searching around the internet and am struggling a bit…

• No. If y is large, the interpretation is approximately as for logs. But otherwise, I don’t think the parameters have a neat interpretation.