I received this email from one of my undergraduate students:
I’m writing to you asking for advice on how to start a career in Data Science. Other professions seem a bit more straight forward, in that accountants for example simply look for Internships and ways into companies from there. From my understanding, the nature of careers in data science seem to be on a project-to-project basis. I’m not sure how to get my foot stuck in the door.
I am expecting to finish degree by Semester 1 2016. In my job searching so far, I have only encountered positions which require 3+ years of previous data analysis experience and have not seen any “entry-level” data analysis positions or graduate data positions. What is the nature of entry level recruitment in this industry?
Any help would be greatly appreciated.
Continue reading →
Di Cook and I are organizing a workshop on “Making data analysis easier” for 18-19 February 2016.
We are calling it WOMBAT2016, which an acronym for Workshop Organized by the Monash Business Analytics Team. Appropriately, it will be held at the Melbourne Zoo. Our plan is to make these workshops an annual event.
Some details are available on the workshop website. Key features are:
- Hadley Wickham is our keynote speaker. He has been instrumental in changing the way we think about data analysis, and providing new tools for tidying, rearranging, summarising and plotting data. His R packages (including tidyr, dplyr, ggplot2, and ggvis) are very widely used.
- Other speakers include Phil Brierley, Eugene Dubossarsky, Heike Hofmann, Thomas Lumley, Andrew Robinson, Elle Saber, Carson Sievert, Zoe van Havre, Geoff Webb, Yanchang Zhao, as well as Di and me.
- The numbers are limited to a total of 100 with a quota on students, academics and people from business/industry. The aim is to have a good mix of people from different backgrounds to encourage productive discussions and mutual learning.
- Register on Eventbrite.
- We also have some places available for contributing speakers (15 minute talks). If you would like to do a contributed talk, you will need to email us a title and abstract by 15 January. We will notify you if your peer-reviewed abstract is successful by 29 January.
If you miss out on the workshop, you can still hear Hadley speak. Data Science Melbourne will host a meetup featuring him in the evening of Monday 22 February 2016.
RStudio has been a life-changer for the way I work, and for how I teach data analysis. I still have a couple of minor frustrations with it, but they are slowly disappearing as RStudio adds features.
I use dual monitors and I like to code on one monitor and have the console and plots on the other monitor. Otherwise I see too little context, and long lines get wrapped making the code harder to read. So I was very excited to see that RStudio has provided a great Christmas present this year, with source code panes able to be split off into separate windows.
You need the preview version as the feature hasn’t yet found its way into the release version. The features are explained in this help file, in which I also discovered the amazing shortcut
Ctrl + . to jump to a function definition. I’ve no idea how long that has been in RStudio, but I’ll be using it a lot.
Now if they would only introduce the ability to select columns for cut/copy/paste …
The github page for the forecast package currently shows the following information
Note the downloads figure: 264K/month. I know the package is popular, but that seems crazy. Also, the downloads figure on github only counts the downloads from the RStudio mirror, and ignores downloads from the other 125 mirrors around the world. Continue reading →
I’ve been having discussions with colleagues and university administration about the best way for universities to manage home-grown software.
The traditional business model for software is that we build software and sell it to everyone willing to pay. Very often, that leads to a software company spin-off that has little or nothing to do with the university that nurtured the development. Think MATLAB, S-Plus, Minitab, SAS and SPSS, all of which grew out of universities or research institutions. This model has repeatedly been shown to stifle research development, channel funds away from the institutions where the software was born, and add to research costs for everyone.
I argue that the open-source model is a much better approach both for research development and for university funding. Under the open-source model, we build software, and make it available for anyone to use and adapt under an appropriate licence. This approach has many benefits that are not always appreciated by university administrators. Continue reading →
I prepared the following notes for a consulting client, and I thought they might be of interest to some other people too.
Let $y_t$ denote the value of the time series at time $t$, and suppose we wish to fit a trend with correlated errors of the form
$$ y_t = f(t) + n_t, $$
where $f(t)$ represents the possibly nonlinear trend and $n_t$ is an autocorrelated error process. Continue reading →
It is a while since I last updated the CRAN version of the forecast package, so I uploaded the latest version (6.2) today. The github version remains the most up-to-date version and is already two commits ahead of the CRAN version.
This update is mostly bug fixes and additional error traps. The full ChangeLog is listed below. Continue reading →
I gave a seminar at Stanford today. Slides are below. It was definitely the most intimidating audience I’ve faced, with Jerome Friedman, Trevor Hastie, Brad Efron, Persi Diaconis, Susan Holmes, David Donoho and John Chambers all present (and probably other famous names I’ve missed).
I’ll be giving essentially the same talk at UC Davis on Thursday. Continue reading →
Jane Frazier spoke at our research team meeting today on “Reproducibility in computational research”. We had a very stimulating and lively discussion about the issues involved. One interesting idea was that reproducibility is on a scale, and we can all aim to move further along the scale towards making our own research more reproducible. For example
- Can you reproduce your results tomorrow on the same computer with the same software installed?
- Could someone else on a different computer reproduce your results with the same software installed?
- Could you reproduce your results in 3 years time after some of your software environment may have changed?
Think about what changes you need to make to move one step further along the reproducibility continuum, and do it.
Jane’s slides and handout are below. Continue reading →
I will be speaking at the Chinese R conference in Nanchang, to be held on 24-25 October, on “Forecasting Big Time Series Data using R”.
Details (for those who can read Chinese) are at china-r.org.