Posts tagged statistics

Transforming data with zeros

I’m cur­rently work­ing with a hydrol­o­gist and he raised a ques­tion that occurs quite fre­quently with real data — what do you do when the data look like they need a log trans­for­ma­tion, but there are zero values?

I asked the ques­tion on stats.stackexchange.com and received some use­ful sug­ges­tions. What fol­lows is a sum­mary based on these answers, my own expe­ri­ence, plus a few papers I dis­cov­ered that deal with the topic. In gen­eral, the most appro­pri­ate course of action depends on the model and the con­text. Zeros can arise for sev­eral dif­fer­ent rea­sons each of which may have to be treated differently.

Box-Cox (BC) transformations

There is a two-parameter ver­sion of the Box-Cox trans­for­ma­tion that allows a shift before transformation:

g(y;\lambda_{1}, \lambda_{2}) =<br />
\begin{cases}<br />
\frac {(y+\lambda_{2})^{\lambda_1} - 1} {\lambda_{1}} & \mbox{when } \lambda_{1} \neq 0 \\\ \log (y + \lambda_{2}) & \mbox{when } \lambda_{1} = 0<br />
\end{cases}.

The usual Box-Cox trans­for­ma­tion sets \lambda_2=0. One com­mon choice with the two-parameter ver­sion is \lambda_1=0 and \lambda_2=1 which has the neat prop­erty of map­ping zero to zero. There is even an R func­tion for this: log1p().  More gen­er­ally, both para­me­ters can be esti­mated. In R, the boxcox.fit() func­tion in pack­age geoR will fit the parameters.

Alter­na­tively, when \lambda_1=0, it has been sug­gested that \lambda_2 should be approx­i­mately one half of the small­est, non-zero value. Another sug­ges­tion is that \lambda_2 should be the square of the first quar­tile divided by the third quar­tile (Sta­hel,  2002).

I’ve used func­tions like this sev­eral times includ­ing in Hyn­d­man & Grun­wald (2000) where we used \log(y+\lambda_2) applied to daily rain­fall data.

One sim­ple spe­cial case is the square root where \lambda_2=0 and \lambda_1=0.5. This works fine with zeros (although not with neg­a­tive val­ues). How­ever, often the square root is not a strong enough trans­for­ma­tion to deal with the high lev­els of skew­ness seen in real data.

Inverse hyper­bolic sine (IHS) transformation

An alter­na­tive trans­for­ma­tion fam­ily was pro­posed by John­son (1949) and is defined by

f(y,\theta) = \text{sinh}^{-1}(\theta y)/\theta = \log(\theta y + (\theta^2y^2+1)^{1/2})/\theta,

where \theta>0. For any value of \theta, zero maps to zero. There is also a two para­me­ter ver­sion allow­ing a shift, just as with the two-parameter BC trans­for­ma­tion. Bur­bidge, Magee and Robb (1988) also dis­cuss the IHS trans­for­ma­tion includ­ing esti­ma­tion of \theta.

The IHS trans­for­ma­tion works with data defined on the whole real line includ­ing neg­a­tive val­ues and zeros. For large val­ues of y it behaves like a log trans­for­ma­tion, regard­less of the value of \theta (except 0). As \theta\rightarrow0, f(y,\theta)\rightarrow y.

Mixed mod­els

For con­tin­u­ous data, there can be a dis­crete spike at zero which can be asso­ci­ated with the sen­si­tiv­ity of the mea­sure­ments. For exam­ple in wind energy, wind below 2m/s is often recorded as zero and the dis­tri­b­u­tion of wind energy pro­duced is con­tin­u­ous with a spike at zero.

With rain­fall data, there is a spike at zero for a dif­fer­ent rea­son — it didn’t rain. These are gen­uine zeros (rather than inde­tectably small values).

With insur­ance data, a sim­i­lar phe­nom­e­non occurs — the dis­tri­b­u­tion of claims is con­tin­u­ous with a large spike at zero.

A fourth exam­ple might be income data — zero if some­one is not in paid work, but a con­tin­u­ous pos­i­tive value otherwise.

In each of these cases, a mix­ture model is prob­a­bly the most appro­pri­ate where part of the model deter­mines the prob­a­bil­ity of a zero, and the other part of the model deter­mines the dis­tri­b­u­tion of the data when it is pos­i­tive. We also used some­thing like this in Hyn­d­man and Grun­wald (2000).

  • Share/Bookmark

Tags:

Statistical Analysis StackExchange site now available

The Q&A site for sta­tis­ti­cal analy­sis, data min­ing, data visu­al­iza­tion, and every­thing else to do with data analy­sis has finally been launched. Please head over to

stats.StackExchange.com

and start ask­ing and answer­ing questions.

Also, spread the word to every­one else who may be inter­ested — work col­leagues, stu­dents, etc. The more peo­ple who use the site, the bet­ter it will be. There are already 170 ques­tions, 513 answers and 387 users.

Even­tu­ally the site will move to a dif­fer­ent domain name and have its own logo, etc.  For now it is in “pub­lic beta” which means that it is fully func­tional, but we are still work­ing out some of the details (such as what it will be called, who will be the mod­er­a­tors, etc.).

R ques­tions are allowed on this new site as well as on the orig­i­nal StackOverflow.com. We are still fig­ur­ing out how to avoid the prob­lem of hav­ing answers on two sites. For now, more sta­tis­ti­cal ques­tions should be directed to stats.StackExchange.com and more programming-oriented ques­tions should go to StackOverflow.com.

  • Share/Bookmark

Tags: , , ,

Stack exchange for statistical analysis needs you!

The pro­posal to cre­ate a Stack­Ex­change site for sta­tis­ti­cal analy­sis is steadily mov­ing for­ward. We have now com­pleted the scop­ing stage which involved find­ing enough peo­ple will­ing to express an inter­est in the idea, and vot­ing on some exam­ple ques­tions to define what is allowed and what is not allowed on the site. The on-topic ques­tions that have been selected are these:

  1. What is a ‘stan­dard deviation’?
  2. Which of the fol­low­ing three graph­ics best dis­plays this data set? Why?
  3. What’s the best way to iden­tify an out­lier in mul­ti­vari­ate data?
  4. Can you give an exam­ple of where I might pre­fer to use a z-test vs a t-test?
  5. What are the dif­fer­ences between Bayesian and Fre­quen­tist reasoning?

Exam­ples of ques­tions con­sid­ered off-topic are:

  1. How do I win in Poker?
  2. I have two chil­dren. One is a boy born on a Tues­day. What is the prob­a­bil­ity I have two boys?
  3. Joe is 8 years old, Mike is 10 years old, and Alice is 13. What is their MEDIAN age?
  4. Where can I access NASA’s data archives?
  5. How much should I expect to pay for a SAS licence?

The next phase is to get peo­ple to com­mit to con­tribut­ing to the site. Many read­ers of this blog have already reg­is­tered as “fol­low­ers” — now you have to make a com­mit­ment to be a con­trib­u­tor as well. The site won’t launch until there are enough peo­ple com­mit­ted to being part of it.

Just go to the site and indi­cate that you are will­ing to be an active par­tic­i­pant once it launches.

If you’re won­der­ing what this is all about, and why this is a much bet­ter approach than the var­i­ous usenet and email help groups, there’s a nice sum­mary on Tal Galili’s blog.

  • Share/Bookmark

Tags: , , ,

Update on a StackExchange site for statistical analysis

About six weeks ago, I pro­posed that there should be a Stack Exchange site for ques­tions on data analy­sis, sta­tis­tics, data min­ing, machine learn­ing, etc. I can finally report that there has been sub­stan­tial progress on this.

The for­mal pro­posal is now at Area 51 where the scope of the new site is being devel­oped and voted on in a demo­c­ra­tic way. The site has been in a pri­vate beta state for a week or so, but is now open for any­one to join in.

So if you’re inter­ested in this pro­posed site for questions/answers on sta­tis­ti­cal analy­sis, please head over to Area 51 and join in the dis­cus­sion and vot­ing on what the site should cover. It would be a good idea to first read the FAQ so you under­stand how the sys­tem works.

  • Share/Bookmark

Tags: , ,

A StackExchange site for statistical analysis?

Reg­u­lar read­ers of this site will know I’m a fan of using Stack Over­flow for ques­tions about LaTeX, R and other areas of pro­gram­ming. Now the peo­ple who pro­duce Stack Over­flow are plan­ning on set­ting up sev­eral new sites for ask­ing ques­tions about other top­ics, and are seek­ing pro­pos­als. I have pro­posed that there should be a site for ques­tions on data analy­sis, sta­tis­tics, data min­ing, machine learn­ing, etc.

It is more likely that my idea will turn into a func­tion­ing site if peo­ple who agree with me vote for it. So if you agree, please head over to meta.stackexchange.com and vote! (You will need to reg­is­ter first, but that’s free.)

If you dis­agree or have any com­ments about the idea, I’d also like to hear from you. But please add your com­ments to my pro­posal rather than here.

  • Share/Bookmark

Tags: , ,

Using Google Reader

Google Reader is a fan­tas­tic way to keep track of new papers that are appear­ing in many dif­fer­ent jour­nals, and also to fol­low some of the inter­est­ing research blogs (and blogs on other top­ics) that are out there. Google Reader checks web­sites for you and lets you know of any new mate­r­ial that appears. Instead of you hav­ing to look at dozens of dif­fer­ent web­sites to dis­cover new infor­ma­tion, all you need to do is open up Google Reader and all the infor­ma­tion comes to you. In some ways it is like an email account, but where the mes­sages con­tain new addi­tions to web­sites that you are inter­ested in.

Google Reader is called an “RSS reader” because it reads RSS feeds. RSS stands for “Really Sim­ple Syn­di­ca­tion”. A web­site with an RSS feed makes it pos­si­ble to track addi­tions to the site with­out actu­ally vis­it­ing it your­self.  There are other RSS read­ers, but Google Reader is the most widely used. Recently Google Reader added a facil­ity so that it now also tracks sites that don’t have RSS feeds.

If you haven’t used it before, here’s how to get started.

  1. Go to www.google.com/reader and log in. If you already have a Google account (e.g., you’re a Gmail user), then just use your usual Google details. If you don’t have a Google account, then you will need to set one up.

     

  2. Click “Add sub­scrip­tion” and type the URL of any web­site you want to track.
  3. When you are read­ing a web­site that you would like to sub­scribe to, click the orange RSS but­ton that looks like this: .
    A mod­ern browser such as Fire­fox or Chrome will fig­ure out that you want to sub­scribe to the RSS feed. If that doesn’t work, just copy the link address and paste it into the “Add sub­scrip­tion” box in Google Reader.

Each morn­ing I read through any­thing new on Google Reader includ­ing new research papers in jour­nals that I track, new arti­cles on some sta­tis­tics blogs that I fol­low, etc. In fact, I have over 500 sub­scrip­tions! I don’t read every arti­cle or it would take all day, but I do scan the head­lines and read what looks interesting.

It can take a while to col­lect all the sub­scrip­tions for jour­nals you might want to read. To make it easy, you can just piggy-back on my jour­nal col­lec­tion (which cov­ers all sta­tis­tics jour­nals, both fore­cast­ing jour­nals, plus a few econo­met­rics and demog­ra­phy jour­nals, as well as all sta­tis­ti­cal preprints on arxiv). Click here if you want to sub­scribe to all the same jour­nals as me.

If you are inter­ested in R, R-bloggers is very use­ful as it com­bines the posts from a large num­ber of blogs about R.  Just go to the site and click on the RSS feed icon and you will be able to add a sub­scrip­tion to your Google Reader account.

For those who like to keep up with LaTeX, the TeX com­mu­nity aggre­ga­tor does some­thing sim­i­lar for blog­gers writ­ing about LaTeX and related top­ics. Again, just click on the RSS feed icon.

Here is a list of sta­tis­tics research blogs. Check them out and sub­scribe to any­thing that takes your fancy.

This web­site has an RSS feed, as do my other web­sites. Just click the orange but­ton at the top-right of the page and select “Google Reader” and then you will receive any new posts I make in your Google Reader account.

  • Share/Bookmark

Tags: , , , ,

Learning by video

There are some nice online videos avail­able on var­i­ous aspects of sta­tis­tics and math­e­mat­ics that might be help­ful to stu­dents try­ing to learn about new areas.

A search on YouTube will lead to a few fairly basic videos.

A bet­ter place to go is YouTube EDU which con­tains mate­r­ial from universities.

Some­thing sim­i­lar is offered at iTunesU

But the best stuff is on Aca­d­e­mic Earth. For example,

These are all excel­lent full lec­ture courses from some of the top US uni­ver­si­ties (Berke­ley, Har­vard, MIT, Prince­ton, Stan­ford, UCLA, Yale).

If any­one knows of some other good sources, please share them in the comments.

  • Share/Bookmark

Tags: ,

More on the evils of statistical tests

Check out the two posts by Galit Shmueli over at Bzst on hypoth­e­sis tests: one on the value of p-values and another on one-sided tests.

She says “Shock­ingly enough, peo­ple seem to really want to use p-values, even if they don’t under­stand them.” That mir­rors my expe­ri­ence too.  Con­fi­dence inter­vals are much more use­ful because they pro­vide a mea­sure of the size of an effect, rather than test­ing if it is equal to some pre­spec­i­fied value.

  • Share/Bookmark

Tags:

Why I don’t like statistical tests

It may come as a shock to dis­cover that a sta­tis­ti­cian does not like sta­tis­ti­cal tests. Isn’t that what sta­tis­tics is all about? Unfor­tu­nately, in some dis­ci­plines sta­tis­ti­cal analy­sis does seem to con­sist almost entirely of hypoth­e­sis test­ing, and therein lies the problem.

The stan­dard prac­tice is to con­struct a hypoth­e­sis test to deter­mine if some attribute of the data is “sig­nif­i­cant” or not, with the stan­dard p-value thresh­old of 5%. The analy­sis is per­ceived to be com­pleted when the p-value comes in under 5%. How­ever, any non-trivial hypoth­e­sis will be sig­nif­i­cant if enough data are col­lected. As George Box said, “all mod­els are wrong, but some are use­ful”. So col­lect­ing more data will demon­strate that the pro­posed hypoth­e­sis is wrong, but that doesn’t make it useless.

Then there is the com­mon con­fu­sion between sta­tis­ti­cally sig­nif­i­cant and prac­ti­cally sig­nif­i­cant. Just because some­thing is sig­nif­i­cant, doesn’t mean it is impor­tant. And just because a p-value is larger than 0.05 does not mean the null hypoth­e­sis is true. Sta­tis­ti­cians learn all this in first year, but still the research lit­er­a­ture is rid­dled with papers that imply otherwise.

The next prob­lem is that p-values are extremely sen­si­tive to collinear­ity. Con­se­quently, to use p-values based on t-tests to deter­mine the sig­nif­i­cance of terms in a regres­sion is silly. Often terms will appear insignif­i­cant, yet they should be included as they improve the pre­dic­tions. Yet this approach is prob­a­bly the most com­mon method for deter­min­ing what vari­ables to include in a regres­sion, even in some stan­dard text­books. The sit­u­a­tion is even worse in autore­gres­sion, where the collinear­ity is often very strong.

Another thing I dis­like about sta­tis­ti­cal tests is the alter­na­tive hypoth­e­sis. This was not orig­i­nally part of hypoth­e­sis test­ing as pro­posed by Fisher. It was intro­duced by Ney­man and Pear­son. Frankly, the alter­na­tive hypoth­e­sis is unnec­es­sary. It is not used in the com­pu­ta­tion of p-values or for deter­min­ing sta­tis­ti­cal sig­nif­i­cance. The only prac­ti­cal use for the alter­na­tive hypoth­e­sis that I can see is in deter­min­ing the power of a test.

Finally, I hate one-sided tests even more than two-sided tests. It’s lit­tle bet­ter than cheat­ing. You claim that a para­me­ter can only pos­si­bly move in one direc­tion, and thereby cut your p-value in half. I sus­pect it is usu­ally done to obtain sig­nif­i­cant results in order to increase the chances of pub­li­ca­tion. In real­ity, can we ever be really sure that a para­me­ter can only be zero or positive?

Now a good sta­tis­ti­cian can avoid all of these errors and use sta­tis­ti­cal tests hon­estly and appro­pri­ately. And I do occa­sion­ally use tests in my papers, hope­fully avoid­ing the above prob­lems. But I strongly pre­fer the pre­dic­tive mod­el­ling approach. That is, if you have two poten­tial mod­els, choose the one that pre­dicts best. Infor­ma­tion cri­te­ria, such as the AIC, are per­fect for this task.

In fore­cast­ing, the only place in which I find test­ing use­ful is in deter­min­ing the order of inte­gra­tion of a time series; i.e., choos­ing d in an ARIMA(p,d,q) model. If I could come up with some way of doing this effec­tively with­out using a unit-root test, I would gladly do so. But so far, I have not found a reli­able alternative.

For more on this topic, see my work­ing paper with Andrey Kostenko.

  • Share/Bookmark

Tags: ,

Songs of Statistics

If you love sta­tis­tics (don’t we all?) and can write Chi­nese (which rules me out), you might like to con­tribute to the Chi­nese National Bureau of Sta­tis­tics cel­e­bra­tions of the 60th anniver­sary of the “found­ing of New China”. They are call­ing for sub­mis­sions of prose, poetry or song which will “enhance people’s patri­otic feel­ings, sta­tis­tics and con­fi­dence”. Here is an Eng­lish trans­la­tion of the page.

Some fur­ther trans­la­tions are on the WSJ page.

I espe­cially like this sec­tion from “Love the Home­land, Love Statistics”:

Because of sta­tis­tics
I can solve the deep­est mys­ter­ies
Because of sta­tis­tics
I will not be lonely again, play­ing in the data
Because of sta­tis­tics
I can rearrange the stars in the skies above
Because of sta­tis­tics
My life is dif­fer­ent, more mean­ing­ful
I love my life, my statistics

Ah. We’re an emo­tional lot really.

  • Share/Bookmark

Tags: