Almost exactly 20 years ago I wrote a paper with Yanan Fan on how sample quantiles are computed in statistical software. It was cited 43 times in the first 10 years, and 457 times in the next 10 years, making it my third paper to receive 500+ citations.
So what happened in 2006 to suddenly increase the citations? I think it was a combination of things: Continue reading →
I received this email from one of my undergraduate students:
I’m writing to you asking for advice on how to start a career in Data Science. Other professions seem a bit more straight forward, in that accountants for example simply look for Internships and ways into companies from there. From my understanding, the nature of careers in data science seem to be on a project-to-project basis. I’m not sure how to get my foot stuck in the door.
I am expecting to finish degree by Semester 1 2016. In my job searching so far, I have only encountered positions which require 3+ years of previous data analysis experience and have not seen any “entry-level” data analysis positions or graduate data positions. What is the nature of entry level recruitment in this industry?
Any help would be greatly appreciated.
RStudio has been a life-changer for the way I work, and for how I teach data analysis. I still have a couple of minor frustrations with it, but they are slowly disappearing as RStudio adds features.
I use dual monitors and I like to code on one monitor and have the console and plots on the other monitor. Otherwise I see too little context, and long lines get wrapped making the code harder to read. So I was very excited to see that RStudio has provided a great Christmas present this year, with source code panes able to be split off into separate windows.
You need the preview version as the feature hasn’t yet found its way into the release version. The features are explained in this help file, in which I also discovered the amazing shortcut
Ctrl + . to jump to a function definition. I’ve no idea how long that has been in RStudio, but I’ll be using it a lot.
Now if they would only introduce the ability to select columns for cut/copy/paste …
The github page for the forecast package currently shows the following information
Note the downloads figure: 264K/month. I know the package is popular, but that seems crazy. Also, the downloads figure on github only counts the downloads from the RStudio mirror, and ignores downloads from the other 125 mirrors around the world. Continue reading →
I’ve been having discussions with colleagues and university administration about the best way for universities to manage home-grown software.
The traditional business model for software is that we build software and sell it to everyone willing to pay. Very often, that leads to a software company spin-off that has little or nothing to do with the university that nurtured the development. Think MATLAB, S-Plus, Minitab, SAS and SPSS, all of which grew out of universities or research institutions. This model has repeatedly been shown to stifle research development, channel funds away from the institutions where the software was born, and add to research costs for everyone.
I argue that the open-source model is a much better approach both for research development and for university funding. Under the open-source model, we build software, and make it available for anyone to use and adapt under an appropriate licence. This approach has many benefits that are not always appreciated by university administrators. Continue reading →
It is a while since I last updated the CRAN version of the forecast package, so I uploaded the latest version (6.2) today. The github version remains the most up-to-date version and is already two commits ahead of the CRAN version.
This update is mostly bug fixes and additional error traps. The full ChangeLog is listed below. Continue reading →
I’m back in California for the next couple of weeks, and will give the following talk at Stanford and UC-Davis.
Optimal forecast reconciliation for big time series data
Time series can often be naturally disaggregated in a hierarchical or grouped structure. For example, a manufacturing company can disaggregate total demand for their products by country of sale, retail outlet, product type, package size, and so on. As a result, there can be millions of individual time series to forecast at the most disaggregated level, plus additional series to forecast at higher levels of aggregation.
A common constraint is that the disaggregated forecasts need to add up to the forecasts of the aggregated data. This is known as forecast reconciliation. I will show that the optimal reconciliation method involves fitting an ill-conditioned linear regression model where the design matrix has one column for each of the series at the most disaggregated level. For problems involving huge numbers of series, the model is impossible to estimate using standard regression algorithms. I will also discuss some fast algorithms for implementing this model that make it practicable for implementing in business contexts.
I’ve always struggled with using
plotmath via the
expression function in R for adding mathematical notation to axes or legends. For some reason, the most obvious way to write something never seems to work for me and I end up using trial and error in a loop with far too many iterations.
There are some tools that I use regularly, and I would like my research students and post-docs to learn them too. Here are some great online tutorials that might help.