A blog by Rob J Hyndman 

Twitter Gplus RSS

Workflow in R

Published on 18 September 2009

This came up recently on Stack­Over­flow. One of the answers was par­tic­u­larly help­ful and I thought it might be worth men­tion­ing here. The idea pre­sented there is  to break the code into four files, all stored in your project direc­tory. These four files are to be processed in the fol­low­ing order.

This file includes all code asso­ci­ated with load­ing the data. Usu­ally, it will be a short file read­ing in data from files.
This is where you do all the pre-​​processing of data, such as tak­ing care of miss­ing val­ues, merg­ing data frames, han­dling out­liers. By the end of this file, the data should be in a clean state, ready to use. It is much bet­ter to do this here rather than clean the data on the orig­i­nal file as this enables you to have a com­plete record of every­thing done to the data.
All of the func­tions needed to per­form the actual analy­sis are stored here.  This file should do noth­ing other than define the func­tions you need for analy­sis. (If you require your own func­tions for load­ing or clean­ing the data, include them at the top of either load.R or clean.R.) In par­tic­u­lar, functions.R should not do any­thing to the data. This means that you can mod­ify this file and reload it with­out hav­ing to go back and repeat steps 1 & 2 which can take a long time to run for large data sets.
Here is the code to actu­ally do the analy­sis. This file will use the func­tions defined in functions.R to do the cal­cu­la­tions, pro­duce fig­ures and tables, etc. All fig­ures and tables that end up in your report, paper or the­sis should be coded here. Never cre­ate fig­ures and tables man­u­ally (i.e., with the mouse and menus) as then you can’t eas­ily reproduce.

It is a good idea to save your work­space after each file is run.

There are many advan­tages to this set up. First, you don’t have to reload the data each time you make a change in a sub­se­quent step. Sec­ond, if you come back to an old project, you will be able to work out what was done rel­a­tively quickly. It also forces a cer­tain amount of struc­tured think­ing in what you are doing, which is helpful.

Often there will be bits and pieces of code that you write, but don’t end up using, yet don’t want to delete. These should either be com­mented out or saved in files with other names. All analy­sis from read­ing data to pro­duc­ing the final results should be repro­ducible by sim­ply source()ing these four files in order with no fur­ther user intervention.

I’ve tried this process on a few projects and found it rather too restric­tive. In par­tic­u­lar, my do.R file often becomes large and unwieldy. Instead, I am now using the fol­low­ing process.

This file sim­ply con­tains a list of source state­ments to run each of the other R files in order.
As above, all of the func­tions needed to per­form the actual analy­sis are stored here.  This file should do noth­ing other than define the func­tions you need for analysis.
All other code is con­tained in files of the form xxx.R which are called in an appro­pri­ate order by main.R. The num­ber and con­tent of these files will depend on the project. Often it will include a load.R file and clean.R file as above. How­ever, I usu­ally have more than one file con­tain­ing the actual analy­sis (instead of the do.R file).

The impor­tant part of this is that run­ning main.R will run the entire project from scratch. So if the data are updated, or the func­tions are changed, it is easy to repeat the entire analy­sis in one step — just run source("main.R").

It is impor­tant to be dis­ci­plined about keep­ing the R files neat and doc­u­mented. You want to be able to fig­ure out what each part of the code does when you look at it a year after writ­ing it. That means insert­ing com­ments and remov­ing any­thing that is not actu­ally used.

Related Posts:

Tags: ,
6 Comments  comments 
  • patri­cio fuenmayor

    estoy usando R para proyec­ciones, y sería intere­sante tener lo que ya esta desarrollado.

  • Pingback: links for 2009-10-29 « Amy G. Dala

  • http://bioinfoblog.it gioby

    and a good make­file to put it all together.

  • http://www.bertelsen.ca Bran­don Bertelsen

    Thanks for the out­line here. Always nice to see some­one else’s process in order to improve your own.

  • François-​​Philippe Dubé

    Thank you for this tip. You might be inter­ested in look­ing at the Pro­ject­Tem­plate (www​.pro​ject​tem​plate​.net), avail­able on CRAN. I per­son­ally find that the full project struc­ture pro­posed is too much for my needs, hence I cre­ate a min­i­mal­ist struc­ture using one of the options in the project cre­ation function.

  • Phillip Burger

    Hav­ing good work­flow is a key to increas­ing pro­duc­tiv­ity. I’m going to adjust how I orga­nize my projects after read­ing your post.

    The other read­ers’ com­ment con­cern­ing http://​www​.pro​ject​tem​plate​.net is also helpful.

    Have you evolved how you orga­nize your projects since your orig­i­nal post?