Preparation for a transition to data-science


My PhD advisor, Professor Michael Blanton, once gave me advice on how to choose which Postdoc offer to take: “Go where the data are …”. There now appears to be an interesting shift in the world of data.  Academia is where most people went to get interesting data for analysis, but now with the massive amounts of data collected online and from mobile devices, it appears that the industry is where a lot of the action is heading (e.g, health care, municipalities, Facebook and so on). This trend results in a masssive migration of brains from academia. Or rephrasing Bob Marley: Exodus! Movement of Da[ta] people.

Often I am asked by non-scientists: “But what does data-science have to do with your studies in astronomy and physics?” Well, as a PhD student and later a Postdoc doing analysis in cosmology, I jokingly described myself not as a true physicist that contemplates equations of physical processes  in-depth, but rather as a number cruncher, or a data analyst. Now that I have decided to transfer to the glorious (?) world of data-science I have been doing research on the topic in order to prepare myself for the transition. Here I describe the various initial steps and resources that I feel are helping me to prepare for non-academic interviews. This is by no means an exhaustive or generic list. I highly recommend reading Jessica Kirkpatrick’s excellent articles on the topic of transitioning to data-science. I also recommend reading the White Paper of Insight Data Science.

(1) Learning from others – There is a big migration of academics to data science for various professional and personal reasons (job security, flexibility in choice of location, good salaries, high demand, new challenges, to name a few).  Feel free to talk to colleagues in your department and worldwide about their preparations, experiences and opinions. Also, if you have some experience on the topic, it would be great for your academic colleagues to know about your transition, because they are probably wondering what it involves, and might be embarrassed to ask because of possible negative connotations some academic environments have on the matter.

(2) Drew Conway uses a Venn Diagram in this nice overview to describe all the aspects that a data-scientist should be knolegable of. One needs not be an expert in everything data related, but should be aware of what there is out there and then go ahead and learn what they find interesting/useful. E.g, I quickly realized that I need to be a better Bayesian (hence the name of this blog). For this reason I started brushing up on my  statistics and computational skills. The means of doing this was with Coursera courses and dabbling with Kaggle competitions, as described below. The photo above is my new notebook that is full of notes I’ve taken down from various statistics and machine learning courses. To keep record of new syntax from the various new languages I have been learning I find electronic notebooks like Evernote useful. I also find iPython Notebook useful for keeping notes regarding syntax of python and Julia.

(3) Convert your Academic CV to a professional resume. Like anything else you have written, there is an art to it, where now you have to be more concise, and focused on how a company/organization would benefite from hiring you. Your resume should be short – a one two sided page, as opposed to a long list of all your contributions in various conferences world wide. When you build your resume, look at others’ to get a feel for what potential employers are looking for. For example, you can find my first resume here.

(4) Programming Languages: A few languages that are highly sought after these day are Python, R and SQL. The great thing about these high level languages is that they are open source and have large communities that support them. I am very glad that when I started my Postdoc, I decided to ditch the astronomer’s curse of IDL (if you’re not an astronomer, don’t ask …) in favor of the much more fun python. R is very popular amongst statisticians, but I have recnetly been suggested python packages such as pandas, scikit and bokeh which enable you to do all statistics, machine learning and plotting all in-house-python. SQL is basic for data base manipulation and data extraction, so also important to put on your to-do list, as well, and eventually your resume. One more word of advice, if you are still stuck with editors like emacs and vi, you might want to take a look into more user friendly ones like TextWrangler and Sublime.

(5) MOOCs – Massive Open Online Courses. Even though we still live in an age where airline companies still treat economy flyers as cattle and country boarders still exist, I am glad to live in an era where one can learn from top Universities FOR FREE. Various online sites (Udacity, edX, Khan Academy) specialize in connecting between hungry minds and Professors who teach anything from Music and History to Astronomy and Neural Networks.  I personally have a Coursera account and have completed five courses in which I did all the homework assignments and weekly quizes. One of them even had a midterm and final exams, which I prepared for like an undergrad. This is the place to admit that when I had my first non-academic interview in 14 years, which happened to be with Google, I felt great that I understood my interviewer’s questions thanks to the various Machine Learning and Statistics courses I’ve taken. As Coursera claim, the level of the courses are not such that one can go out and do research in the field, but rather be able to immediately implement what they have learned. For example, the now classical Machine Learning course by Andrew Ng (a Coursera co-founder) provides not only the theory behind the state of art of the field, but also actual Matlab code that can be applied. I am currently auditing for free the Probabilistic Graphical Models course given by another of Coursera’s co-founders, Daphne Koller.

OK, so you just learned (or are preparing to learn) a whole lot of statistics, and possibly a programming language or two. In a job interview you will need to show experience. In your academic research you probably worked on long projects, but for the industry you’ll need to show that you can be effective in short term projects. That’s where the following two entries might be useful.

(6) Online competitions: Kaggle serve as a bridge between organizations with questions about data they provide and data scientists that are eager to try out new tricks. The competitions are timed from anywhere between a few weeks to over a year, with prizes that could be cash, kudos or even a job. It hosts competitions for companies such as high tech, car insruance agencies and airlines as well as academics such as the Large Hadron Collidor and Astronomers. It is newbie friendly with forums that are open for communicating methods. Kaggle also has tutorial competitions as well, e.g the Titanic competition in which you can practice techniques on actual data of the passengers to guess the survivors.

(7) Seminars, Workshops, Bootcamps, Retreats. Various recruitment companies are starting to ride the exodus wave to help place scientists (and some engineers, too) in companies that are looking to hire data scientists. The first that I am aware of is Insight Data Science, which have sessions in the San Francisco Bay Area, and in New York City. I failed in my first attempt to get accepted for the January 2014 class (I think that I botched up part of my interview because I did not prepare a working sample of my research), and got accepted to the NYC August 2014 session. I learned that the acceptance rate from the first to the second went from 1/15 to 1/20, and that they have a 100% job placement rate. I did, however, decline their generous offer (5,000$ fellowship) in order to participate in a similar program in London called Science to Data Science. Their formats are slightly different, and will be a focus of a future blog entry. Other workshops that I am aware of: The Data Incubator (NYC), ASI (London) and the Data Science Retreat (Berlin) Note that because of the high demand these are very competitive to get accepted to, hence one should put a lot of effort in the applications and preparation for the interviews, just like for any job.

Please take in mind, that there are probably many other avenues to assist a transition to data science. These a merely the steps that I looked into. Throughout this blog in August and September 2014 I will share my experiences in the Science to Data Science program, as well as for looking for a job in London.