Posts tagged R

Econometrics and R

Econo­me­tri­cians seem to be rather slow to adopt new meth­ods and new tech­nol­ogy (com­pared to other areas of sta­tis­tics), but slowly the use of R is spread­ing. I’m now receiv­ing requests for ref­er­ences show­ing how to use R in econo­met­rics, and so I thought it might be help­ful to post a few sug­ges­tions here.

A use­ful on-line and free resource is “Econo­met­rics in R” by Grant Farnsworth. It cov­ers some com­mon econo­met­ric meth­ods includ­ing het­eroskedas­tic­ity in regres­sion, pro­bit and logit mod­els, tobit regres­sion, and quan­tile regres­sion. In the time series area, it cov­ers ARIMA, ARFIMA, ARCH and GARCH mod­els, as well as a few of the stan­dard tests for unit roots and auto­cor­re­la­tion. It’s brief but it does pro­vide code that will help peo­ple famil­iar with econo­met­rics to get started using R.
If you are pre­pared to pay, an excel­lent book is Kleiber and Zeilis’s Applied Econo­met­rics with R. It cov­ers sim­i­lar ground to Farnsworth but in more detail. This is the book I usu­ally rec­om­mend to any­one with an econo­met­rics back­ground who is want­ing to get started with R. It would also be very suit­able for some­one study­ing econo­met­rics at about upper under­grad­u­ate level. Achim Zeileis is a well-known expert in R pro­gram­ming, so you can be sure the code in this book is effi­cient and well-written.
Another use­ful book is Pfaff’s Analy­sis of Inte­grated and Coin­te­grated Time Series with R which cov­ers unit root tests, coin­te­gra­tion, VECM mod­els, etc.
Vinod’s Hands-On Inter­me­di­ate Econo­met­rics Using R con­tains a lot of exam­ples and code-snippets which can be very help­ful. Unfor­tu­nately, the exam­ples do not always show the best prac­tice in R coding.
More detailed case stud­ies using R are pro­vided in Advances in Social Sci­ence Research Using R, edited by H.D. Vinod. Many of the case stud­ies are from econo­met­rics includ­ing an excel­lent chap­ter by Bruce McCul­lough on econo­met­ric computing.

There are of course dozens of books on R with a more sta­tis­ti­cal per­spec­tive, includ­ing sev­eral on time series. But I will leave them for another post.

  • Share/Bookmark

Tags: ,

Twenty rules for good graphics

One of the things I repeat­edly include in ref­eree reports, and in my responses to authors who have sub­mit­ted papers to the Inter­na­tional Jour­nal of Fore­cast­ing, are com­ments designed to include the qual­ity of the graph­ics. Recently some­one asked on stats.stackexchange.com about best prac­tices for pro­duc­ing plots. So I thought it might be help­ful to col­late some of the answers given there and add a few com­ments of my own taken from things I’ve writ­ten for authors.

The fol­low­ing “rules” are in no par­tic­u­lar order.

  1. Use vec­tor graph­ics such as eps or pdf. These scale prop­erly and do not look fuzzy when enlarged. Do not use jpeg, bmp or png files as these will look fuzzy when enlarged, or if saved at very high res­o­lu­tions will be enor­mous files. Jpegs in par­tic­u­lar are designed for pho­tographs not sta­tis­ti­cal graphics.
  2. Use read­able fonts. For graph­ics I pre­fer sans-serif fonts such as Hel­vetica or Arial. Make sure the font size is read­able after the fig­ure is scaled to what­ever size it will be printed.
  3. Avoid clut­tered leg­ends. Where pos­si­ble, add labels directly to the ele­ments of the plot rather than use a leg­end at all. If this won’t work, then keep the leg­end from obscur­ing the plot­ted data, and make it small and neat.
  4. If you must use a leg­end, move it inside the plot, in a blank area.
  5. No dark shaded back­grounds. Excel always adds a nasty dark gray back­ground by default, and I’m always ask­ing authors to remove it. Graph­ics print much bet­ter with a white back­ground. The ggplot for R also uses a gray back­ground (although it is lighter than the Excel default). I don’t mind the ggplot ver­sion so much as it is used effec­tively with white grid lines. Nev­er­the­less, even the light gray back­ground doesn’t lend itself to printing/photocopying. White is better.
  6. Avoid dark, dom­i­nat­ing grid lines (such as those pro­duced in Excel by default). Grid lines can be use­ful, but they should be in the back­ground (light gray on white or white on light gray).
  7. Keep the axis lim­its sen­si­ble. You don’t have to include a zero (even if Excel wants you to). The defaults in R work well. The basic idea is to avoid lots of white space around the plot­ted data.
  8. Make sure the axes are scaled prop­erly. Another Excel prob­lem is that the hor­i­zon­tal axis is some­times treated cat­e­gor­i­cally instead of numer­i­cally. If you are plot­ting a con­tin­u­ous numer­i­cal vari­able, then the hor­i­zon­tal axis should be prop­erly scaled for the numer­i­cal variable.
  9. Do not for­get to spec­ify units.
  10. Tick inter­vals should be at nice round numbers.
  11. Axes should be prop­erly labelled.
  12. Use linewidths big enough to read. 1pt lines tend to dis­ap­pear if plots are shrunk.
  13. Avoid over­lap­ping text on plot­ting char­ac­ters or lines.
  14. Fol­low Tufte’s prin­ci­ples by remov­ing chart junk and keep­ing a high data-ink ratio.
  15. Plots should be self-explanatory, so included detailed captions.
  16. Use a sen­si­ble aspect ratio. I think width:height of about 1.6 works well for most plots.
  17. Pre­pare graph­ics in the final aspect ratio to be used in the pub­li­ca­tion. Dis­torted fonts look awful.
  18. Use points not lines if ele­ment order is not relevant.
  19. When prepar­ing plots that are meant to be com­pared, use the same scale for all of them. Even bet­ter, com­bine plots into a sin­gle graph if they are related.
  20. Avoid pie-charts. Espe­cially 3d pie-charts. Espe­cially 3d pie-charts with explod­ing wedges. I promise all my stu­dents an instant fail if I ever see any­thing so appalling.

The clas­sic books on graph­ics are:

These are both highly rec­om­mended. (If you can’t see the books above, turn off your ad-blocker.)

  • Share/Bookmark

Tags: , ,

Statistical Analysis StackExchange site now available

The Q&A site for sta­tis­ti­cal analy­sis, data min­ing, data visu­al­iza­tion, and every­thing else to do with data analy­sis has finally been launched. Please head over to

stats.StackExchange.com

and start ask­ing and answer­ing questions.

Also, spread the word to every­one else who may be inter­ested — work col­leagues, stu­dents, etc. The more peo­ple who use the site, the bet­ter it will be. There are already 170 ques­tions, 513 answers and 387 users.

Even­tu­ally the site will move to a dif­fer­ent domain name and have its own logo, etc.  For now it is in “pub­lic beta” which means that it is fully func­tional, but we are still work­ing out some of the details (such as what it will be called, who will be the mod­er­a­tors, etc.).

R ques­tions are allowed on this new site as well as on the orig­i­nal StackOverflow.com. We are still fig­ur­ing out how to avoid the prob­lem of hav­ing answers on two sites. For now, more sta­tis­ti­cal ques­tions should be directed to stats.StackExchange.com and more programming-oriented ques­tions should go to StackOverflow.com.

  • Share/Bookmark

Tags: , , ,

More StackExchange sites

The Stack­Ex­change site on Sta­tis­ti­cal Analy­sis is about to go into pri­vate beta test­ing. This is your last chance to com­mit if you want to be part of the pri­vate beta test­ing. Don’t worry if you miss out — it will only be a week before it is then open to the public.

There is also a Stack­Ex­change site pro­posal for TeX, LaTeX and friends. Pre­sum­ably that means that most of the LaTeX ques­tions on Stack­Over­flow will then move to this new site. It still needs a cou­ple of hun­dred more peo­ple to com­mit before it can be launched, so if you are inter­ested in LaTeX, please com­mit to being part of it.

Another site pro­posal that may be of inter­est to read­ers of this blog is the one on Eng­lish lan­guage usage.

A few pro­pos­als are already open to the pub­lic for beta test­ing. One that I’ve been using a lit­tle is Web Apps which is use­ful for ques­tions on Gmail, Google reader, Word­Press, etc.

  • Share/Bookmark

Tags: , , , ,

Stack exchange for statistical analysis needs you!

The pro­posal to cre­ate a Stack­Ex­change site for sta­tis­ti­cal analy­sis is steadily mov­ing for­ward. We have now com­pleted the scop­ing stage which involved find­ing enough peo­ple will­ing to express an inter­est in the idea, and vot­ing on some exam­ple ques­tions to define what is allowed and what is not allowed on the site. The on-topic ques­tions that have been selected are these:

  1. What is a ‘stan­dard deviation’?
  2. Which of the fol­low­ing three graph­ics best dis­plays this data set? Why?
  3. What’s the best way to iden­tify an out­lier in mul­ti­vari­ate data?
  4. Can you give an exam­ple of where I might pre­fer to use a z-test vs a t-test?
  5. What are the dif­fer­ences between Bayesian and Fre­quen­tist reasoning?

Exam­ples of ques­tions con­sid­ered off-topic are:

  1. How do I win in Poker?
  2. I have two chil­dren. One is a boy born on a Tues­day. What is the prob­a­bil­ity I have two boys?
  3. Joe is 8 years old, Mike is 10 years old, and Alice is 13. What is their MEDIAN age?
  4. Where can I access NASA’s data archives?
  5. How much should I expect to pay for a SAS licence?

The next phase is to get peo­ple to com­mit to con­tribut­ing to the site. Many read­ers of this blog have already reg­is­tered as “fol­low­ers” — now you have to make a com­mit­ment to be a con­trib­u­tor as well. The site won’t launch until there are enough peo­ple com­mit­ted to being part of it.

Just go to the site and indi­cate that you are will­ing to be an active par­tic­i­pant once it launches.

If you’re won­der­ing what this is all about, and why this is a much bet­ter approach than the var­i­ous usenet and email help groups, there’s a nice sum­mary on Tal Galili’s blog.

  • Share/Bookmark

Tags: , , ,

Learning R by video

For those peo­ple who pre­fer to be shown how to do some­thing rather than read the instruc­tions, there are some videos on using R avail­able online. Here are the ones I know about. Please add links to other sim­i­lar resources in the comments.

  • Share/Bookmark

Tags: ,

Using Google Reader

Google Reader is a fan­tas­tic way to keep track of new papers that are appear­ing in many dif­fer­ent jour­nals, and also to fol­low some of the inter­est­ing research blogs (and blogs on other top­ics) that are out there. Google Reader checks web­sites for you and lets you know of any new mate­r­ial that appears. Instead of you hav­ing to look at dozens of dif­fer­ent web­sites to dis­cover new infor­ma­tion, all you need to do is open up Google Reader and all the infor­ma­tion comes to you. In some ways it is like an email account, but where the mes­sages con­tain new addi­tions to web­sites that you are inter­ested in.

Google Reader is called an “RSS reader” because it reads RSS feeds. RSS stands for “Really Sim­ple Syn­di­ca­tion”. A web­site with an RSS feed makes it pos­si­ble to track addi­tions to the site with­out actu­ally vis­it­ing it your­self.  There are other RSS read­ers, but Google Reader is the most widely used. Recently Google Reader added a facil­ity so that it now also tracks sites that don’t have RSS feeds.

If you haven’t used it before, here’s how to get started.

  1. Go to www.google.com/reader and log in. If you already have a Google account (e.g., you’re a Gmail user), then just use your usual Google details. If you don’t have a Google account, then you will need to set one up.

     

  2. Click “Add sub­scrip­tion” and type the URL of any web­site you want to track.
  3. When you are read­ing a web­site that you would like to sub­scribe to, click the orange RSS but­ton that looks like this: .
    A mod­ern browser such as Fire­fox or Chrome will fig­ure out that you want to sub­scribe to the RSS feed. If that doesn’t work, just copy the link address and paste it into the “Add sub­scrip­tion” box in Google Reader.

Each morn­ing I read through any­thing new on Google Reader includ­ing new research papers in jour­nals that I track, new arti­cles on some sta­tis­tics blogs that I fol­low, etc. In fact, I have over 500 sub­scrip­tions! I don’t read every arti­cle or it would take all day, but I do scan the head­lines and read what looks interesting.

It can take a while to col­lect all the sub­scrip­tions for jour­nals you might want to read. To make it easy, you can just piggy-back on my jour­nal col­lec­tion (which cov­ers all sta­tis­tics jour­nals, both fore­cast­ing jour­nals, plus a few econo­met­rics and demog­ra­phy jour­nals, as well as all sta­tis­ti­cal preprints on arxiv). Click here if you want to sub­scribe to all the same jour­nals as me.

If you are inter­ested in R, R-bloggers is very use­ful as it com­bines the posts from a large num­ber of blogs about R.  Just go to the site and click on the RSS feed icon and you will be able to add a sub­scrip­tion to your Google Reader account.

For those who like to keep up with LaTeX, the TeX com­mu­nity aggre­ga­tor does some­thing sim­i­lar for blog­gers writ­ing about LaTeX and related top­ics. Again, just click on the RSS feed icon.

Here is a list of sta­tis­tics research blogs. Check them out and sub­scribe to any­thing that takes your fancy.

This web­site has an RSS feed, as do my other web­sites. Just click the orange but­ton at the top-right of the page and select “Google Reader” and then you will receive any new posts I make in your Google Reader account.

  • Share/Bookmark

Tags: , , , ,

Workflow in R

This came up recently on Stack­Over­flow. One of the answers was par­tic­u­larly help­ful and I thought it might be worth men­tion­ing here. The idea pre­sented there is  to break the code into four files, all stored in your project direc­tory. These four files are to be processed in the fol­low­ing order.

load.R
This file includes all code asso­ci­ated with load­ing the data. Usu­ally, it will be a short file read­ing in data from files.
clean.R
This is where you do all the pre-processing of data, such as tak­ing care of miss­ing val­ues, merg­ing data frames, han­dling out­liers. By the end of this file, the data should be in a clean state, ready to use. It is much bet­ter to do this here rather than clean the data on the orig­i­nal file as this enables you to have a com­plete record of every­thing done to the data.
functions.R
All of the func­tions needed to per­form the actual analy­sis are stored here.  This file should do noth­ing other than define the func­tions you need for analy­sis. (If you require your own func­tions for load­ing or clean­ing the data, include them at the top of either load.R or clean.R.) In par­tic­u­lar, functions.R should not do any­thing to the data. This means that you can mod­ify this file and reload it with­out hav­ing to go back and repeat steps 1 & 2 which can take a long time to run for large data sets.
do.R
Here is the code to actu­ally do the analy­sis. This file will use the func­tions defined in functions.R to do the cal­cu­la­tions, pro­duce fig­ures and tables, etc. All fig­ures and tables that end up in your report, paper or the­sis should be coded here. Never cre­ate fig­ures and tables man­u­ally (i.e., with the mouse and menus) as then you can’t eas­ily reproduce.

It is a good idea to save your work­space after each file is run.

There are many advan­tages to this set up. First, you don’t have to reload the data each time you make a change in a sub­se­quent step. Sec­ond, if you come back to an old project, you will be able to work out what was done rel­a­tively quickly. It also forces a cer­tain amount of struc­tured think­ing in what you are doing, which is helpful.

Often there will be bits and pieces of code that you write, but don’t end up using, yet don’t want to delete. These should either be com­mented out or saved in files with other names. All analy­sis from read­ing data to pro­duc­ing the final results should be repro­ducible by sim­ply source()ing these four files in order with no fur­ther user intervention.

I’ve tried this process on a few projects and found it rather too restric­tive. In par­tic­u­lar, my do.R file often becomes large and unwieldy. Instead, I am now using the fol­low­ing process.

main.R
This file sim­ply con­tains a list of source state­ments to run each of the other R files in order.
functions.R
As above, all of the func­tions needed to per­form the actual analy­sis are stored here.  This file should do noth­ing other than define the func­tions you need for analysis.
xxx.R
All other code is con­tained in files of the form xxx.R which are called in an appro­pri­ate order by main.R. The num­ber and con­tent of these files will depend on the project. Often it will include a load.R file and clean.R file as above. How­ever, I usu­ally have more than one file con­tain­ing the actual analy­sis (instead of the do.R file).

The impor­tant part of this is that run­ning main.R will run the entire project from scratch. So if the data are updated, or the func­tions are changed, it is easy to repeat the entire analy­sis in one step — just run source("main.R").

It is impor­tant to be dis­ci­plined about keep­ing the R files neat and doc­u­mented. You want to be able to fig­ure out what each part of the code does when you look at it a year after writ­ing it. That means insert­ing com­ments and remov­ing any­thing that is not actu­ally used.

  • Share/Bookmark

Tags: ,

Finding an R function

Sup­pose you want a func­tion to fit a neural net­work. What’s the best way to find it? Here are three steps that help to find the elu­sive func­tion rel­a­tively quickly.

First, use help.search("neural") or the short­hand ??neural. This will search the help files of installed pack­ages for the word “neural”. Actu­ally, fuzzy match­ing is used so it returns pages that have words sim­i­lar to “neural” such as “nat­ural”. For a stricter search, use help.search("neural",agrep=FALSE). The fol­low­ing results were returned for me (using the stricter search).

nnet::nnetHess           Evaluates Hessian for a Neural Network
nnet::nnet               Fit Neural Networks
nnet::predict.nnet       Predict New Examples by a Trained Neural Net
tseries::terasvirta.test
                         Teraesvirta Neural Network Test for Nonlinearity
tseries::white.test      White Neural Network Test for Nonlinearity

If you want to look through pack­ages that you have not nec­es­sar­ily installed, you could try using the findFn func­tion in the sos pack­age. This func­tion searches the help pages of pack­ages cov­ered by the RSite­Search archives (which includes all pack­ages on CRAN). For example

require("sos")
findFn("neural")

returns 206 matches (on 13 Sep­tem­ber 2009) from over 60 pack­ages. These are ordered based on a rel­e­vance score, so the top few pack­ages on the list are prob­a­bly the most use­ful. (There’s a good dis­cus­sion of findFn here.)

To look even fur­ther afield, use RSiteSearch("neural") which will send your search to R site search. As well as CRAN pack­ages, this cov­ers the R-help mail­ing list archives, help pages, vignettes and task views.

Still no luck? Try Rseek.org. It’s a restricted Google search cov­er­ing on R-related sites.

Finally, if all else fails, try ask­ing a ques­tion on Stack­Over­flow (make sure your ques­tion is tagged with “r”), or send your ques­tion to the R-help mail­ing list. Both usu­ally solicit replies in under an hour. How­ever, don’t do this with­out first try­ing to find the func­tion using one of the other methods.

  • Share/Bookmark

Tags:

R help on StackOverflow

Ever since I began using R about ten years ago, the best place to find R help was on the R-help mail­ing list. But it is time-consuming search­ing through the archives try­ing to find some­thing from a long time ago, and there is no way to sort out the good advice from the bad advice.

But now there is a new tool and it is very neat. Head over to the R tag on Stack­Over­flow. Stack­Over­flow is a web­site for pro­gram­ming ques­tions. It’s much bet­ter than a mail­ing list because it allows easy search­ing through answers, and vot­ing on answers so that the best answers appear at the top.

If you are a reg­is­tered user (reg­is­tra­tion is free), you can vote on the answers that you find. The more votes your answers receive, the more priv­i­leges you have on the site. This is the web at its best! Here are my answers to date.

It would be great if more R users migrated from the old R-help list to StackOverflow.

  • Share/Bookmark

Tags: ,