I have been traveling during the last two weeks. I visited Fred Hutchinson Cancer Research Center on Oct 16 and the Department of Biostatistics at Johns Hopkins at the invitation of Simply Statistics on Oct 23. Today Christian Robert was visiting our department at Iowa State, and I also talked to him. It is really cool to talk to those guys, and here are some quick notes.
FHCRC and Bioconductor
I went to Seattle for a tutorial at Visweek. Since I did not have a talk there, I decided to pay a visit to Fred Hutchinson and Bioconductor. This place has a nice view of the Lake Union:
These pictures were taken from the Space Needle later instead of Hutch…
I met Raphael Gottardo (and Mike Jiang in his lab), Dan Tenenbaum and Zach Stednick; Bioc people were busy at that time with a course, so I did not manage to meet more. My biology has gone back to the kindergarten level after I went to college; it seems only four letters remained in my brain: ATCG (I kind of remember chromosome). So it was hopeless to talk about biology.
The real reason I visited them was all of them more or less tried my knitr package, which apparently excited me. Through the visit, I saw Sweave was still a challenge for them to switch to knitr, because
R CMD build heavily relies on Sweave to build package vignettes. That is why I posted this wishlist to R core after I came back, hoping alternative packages could be allowed for building package vignettes. FWIW, Sweave was a great invention, but we can just do better. Perhaps Bioconductor can be a good place to experiment with better R documentation like this and this (Markdown/HTML vignettes).
Coincidently, I learned Zach was one of the organizers of Seattle useR Group, and also the student of Roger Peng, whom I would be visiting in the following week.
When I was just about to leave, John Ramey saw me on his way to get a coffee. My goodness, I totally forgot he was at Hutch as well. John was an early user of knitr, and I caught him at JSM this year with Ryan Rosario. Since I had to leave to meet Joe (RStudio), I invited him for dinner in the evening, where we had a good chat together with Tengfei Yin.
RStudio has been developing a cool package for building interactive (or precisely, reactive) web interfaces with R, but I probably should not write too much about it since it is still in early beta stage at this point. JJ and Joe are really awesome developers. It is amazing to see how quickly they adapted to reference classes in R.
Besides, they are also excellent listeners and make use of every chance to get the feedback from users. I did not know Joe was in Seattle, and he just quickly came to me.
I was thrilled when I received an invitation from Simply Statistics for a visit to the Biostatistics Department at Johns Hopkins. I started following their blog since the Duke Saga, and I was shocked by the story: non-reproducible research kills people, literally!
This was probably the best visit I had ever had. I like everybody there. I agree with so many thoughts of the three authors behind Simply Statistics, Roger, Jeff (aka Leekasso) and Rafa, such as the skepticism on measure theory, the fast journal, the deterministic machine, the point of point estimate, etc. Sometimes I want to yell at them because they are ruining my PhD – I need time to finish my thesis instead of thinking of all kinds of “weird” ideas. Well, I’m kidding :)
1. Academia or industry?
I have been asked for quite a few times about my plan for future career. Initially I thought I was doomed for academia, since I really do not care much about measure theory. It is a PhD core course at Iowa State like most other departments in the US, and we have super professors on this course. I did a stupid thing when I was learning this course. I asked the professor this:
Have you ever seen a set which is not measurable in mathematical statistics?
Apparently I was bored from the very beginning when we were trying to construct a non-measurable set. I did not understand why every statistics PhD has to start with measurable sets. I mean that seems to be too far away from statistics, at which stage we take many things “for granted”. We assume measurable functions and smoothness up to a certain order of derivatives; R-N theorem is beautiful, but we rarely go beyond a limited number of density functions in practice, neither do we go beyond Lebesgue and counting measures. We use the Dirichlet function to show Riemann integration does not work but it has a Lebesgue integral, but I see no point in such a weird function; for the most of my time, I still use the rules for Riemann integration.
So narrow-minded as I am, will I be able to survive the academia? Before my visit, I thought the answer would be definitely no. Kasper Hansen kind of convinced me this might not be correct. Doing theoretical work is one way to go, but it does not mean doing computing is hopeless. That said, it is not enough to do general computing; one may have to find a very specific direction, e.g. work on very specific but also very challenging type of data.
2. The talk
I gave an informal talk titled “Can you reproduce your homework?” primarily to students and post-docs there. I got the hilarious website Research in Progress from Simply Statistics just a few days ago, and I immediately used some funny GIF images in my HTML5 slides. They fit pretty well with my talk.
3. Behind the tan door
This is my only regret of the visit. Damn it! I should have taken pictures with the great comedians in the department. I did not realize the video “Behind the tan door: getting ahead in academia” was made at JHSPH until I came back. I watched it in January last year, and I loved the humor so much that I watched it again for several times after that, so I’m very familiar with the faces of the actors. That is why I immediately recognized Tom Louis when he passed my door, and I told Roger I knew that face. He was surprised that I knew that video.
Later that day I saw Scott Zeger twice. His face was even more familiar to me; I can never forget his hilarious logistic regression model on promotion of young faculties in that video.
Last year I emailed Steve Goodman (who was behind the camera in the video) and asked him what “tan door” means. I learned tan was the color of the doors at JHSPH, and there was a well-known movie in the early years with a similar name (just like he did not dare tell me, I do not dare tell you here). At that time I noticed Steve was at JHSPH, but obviously I forgot it later. Anyway, here is the video that I highly recommend (with permission to re-publish):
4. Point estimate has no point
I’m exaggerating a little bit. Someone told me that PowerPoint has neither power nor point, and similarly I think point estimate alone has no point. This is not part of the visit; Simply Statistics often has a good point. The latest blog post on Nate Silver was such a good one. The problem is the general public only care about the point estimate in most time, and ignore the uncertainty associated with the estimate. In theory, we should take a look at both.
5. Reproducible research
In fact I’m not very confident with talking about this big topic; as I said in my slides, it is deep. During the visit, they brought up a few challenging problems in practice, and what I have been doing is smallish things such as homework and websites. I’m not sure how things scale to Gigabytes of complex data, and I really need to know more about problems that data practitioners are facing. I’ll announce a website that implements my fast journal idea in the next few days.
I just got a chance to attend the workshop Reproducibility in Computational and Experimental Mathematics to be held at ICERM (Brown University), which seems to be a great opportunity to learn more about broader issues about reproducible research.
The blog of Christian Robert was named as Xi’an’s OG, which was the reason why it caught my attention long time ago. Xi’an happens to be a big city in China, but it has nothing to do with Christian. Obviously it has confused enough Chinese people. He told me he was also confused by the road sign PED XING like me. It took me like two years to figure out what XING means, and Americans may never understand why Chinese are confused by this: XING happens to be the Chinese Pinyin of the character “Walk” (行). We are confused just because we have no idea why Americans put a road sign in Chinese!!
I showed him my work on the R packages, and he kindly promoted them in his blog later. It also surprised me that he does not use BUGS as a Bayesian, and he is the first person I have ever met who used TeX instead of LaTeX in the early years. I know there must be plenty of people; they just did not tell me.
He wanted to be able to run R code inside the slides, and one of my ideas was to use John Verzani’s gWidgetsWWW2 package. It will certainly work, as I saw John demonstrated at useR! 2012 (he made his slides with this package). I’m wondering if someone could wrap up a package specially for creating slides with gWidgetsWWW2.
I asked him of his opinion on a PhD thesis based on R packages instead of theorems. I was happy to hear him say “Well, you know, there are theorems in theses which are never used…”. I hope my thesis committee will agree with him.
Ah, a long blog post… If there is a take home message, it will be something I have said for many times: you need a website; it really helps.