This week we sat down with +Rachel Schutt
, a Statistician in Google’s New York research group, also currently teaching a course in Data Science at Columbia University. We were interested to hear her views on the emerging field of Data Science, the complexities of gaining insight from massive data sets, and what motivates her to educate the next generation of problem solvers:Research at Google:
How long have you worked at Google and, as a Statistician, what made you want to join?Rachel Schutt:
I joined Google in 2009 after completing my PhD in Statistics at Columbia. I interned here in 2008 and worked on large-scale graph computations; the framework I worked with was called Pregel and now there are open source versions. I enjoyed it so much I knew I wanted to come here.
I have Masters degrees in both Engineering and Mathematics, but I switched gears and got my PhD in Statistics. My thesis research was on modeling the spread of epidemics on networks. The networks could be thought of as offline networks of connected human beings or online social networks. The epidemic could be the spread of a disease or a video going viral, for example, which are mathematically equivalent.
I was excited to hear that Google hired Statisticians. Understanding the real world and solving real-world problems through data is important to me, and I was interested in working with the large amounts of data that we contend with at Google. I also wanted to work with people like Diane Lambert (http://goo.gl/8tJUu
) and Daryl Pregibon (http://goo.gl/BPi3J
), who are masters in the field. When Google announced we were building Google+ I knew I wanted to be part of that team because of the network structure of the data, so about a year into my time here I began working with the Google+ team.R@G:
In this month’s edition of Harvard Business Review, Data Scientist is labeled the “Sexiest Job of the 21st Century” (http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
). We’re seeing the term a lot in the press lately. So we have to ask: what is
The space of Data Science is large, and there are many opinions on what a Data Scientist is. My experience both working at Google and now teaching at Columbia has heavily informed my understanding of the scope of Data Science. As a field, it has many dimensions: machine learning, statistics, visualization, communication, computer programming, engineering, and application areas across disciplines including but not limited to biology, sociology, finance, health, urban studies and new media.
I think it’s interesting, this phenomenon with the labeling and marketing around the term “Data Science”. There are lots of people at Google you could call Data Scientists, although they don’t go by that term. They work across Google products; for example, Correlate (flu trends), Ngrams for books, search and ads quality, YouTube, and so on. They’re doing Data Science but their job titles are research scientists, statisticians, quantitative analysts, software engineers, user experience researchers. +Hal Varian
, Google’s Chief Economist, declared Statistician the sexiest job of the decade (http://goo.gl/jPgMz
) a few years ago, and now Data Scientist is the “Sexiest Job of the 21st Century”. That’s a step up -- from decade to century!R@G:
Could you talk some more about what you work on here, and how you collaborate with other groups and disciplines across Google? RS:
An exciting part about being at Google has been the opportunity to be part of a strong inter-disciplinary Data Science team-- the Google+ Data Science team-- which has software engineers, physicists, social scientists, computer scientists and statisticians. If you’re able to put together a set of people with complementary skill-sets, you can create a lot of synergy and potential. You need to have people who are good at analyzing data, as well as those who are good at processing it and building up the infrastructure around it.
One should be honest about what one’s strengths and weaknesses are and embrace working with people who are stronger than you in some things-- that’s how you learn, and if you can get past your ego, a lot can be accomplished. My profile is strong in statistics, mathematics and machine learning. In written communication I’m strong, but I’m less strong verbally. Although I’m interested in social behavior, I’m not an expert social scientist. Visualization is more of a weak area, although I value it and do it. Computer programming is also a weaker area for me-- at least compared to the software engineers we have at Google --but I still do a lot of it. The people I work with made up for what I lacked and vice versa. Plus now I’m better at what I was weaker at before.
Specifically some examples of what I look at: What kinds of behavior correlate with people coming back to the product? Are users forming communities? How are they discovering content? How does a post travel through time and through the network? Often we’re interested in making causal statements, so we use experiments and causal modeling for that. We take privacy extremely seriously and work at an anonymized and aggregated level.
I also help design experiments. For experiments, we’ll change the underlying algorithm or UI, then see if that causes a better user experience. I prototype algorithms, and I find insights around the data to help decision-makers make informed decisions. When we find insights, we don’t just change strategy, we’ll build the changes back into the product. I like the end-to-end aspect of this work.
My approach is human + machine-- finding patterns and meaning in data and trying to understand human behavior. You can automate your way of thinking eventually. I believe in computing and the power of math and statistics-- but I’m also interested in the process of gaining intuition from data.R@G:
You were recently quoted in CIO (http://goo.gl/N9Ejy
), where you say that part of the motivation for your Columbia Data Science course is the wide opinion that Data Science can’t be taught in a classroom. How did that course come about, and how has your work at Google informed its content?RS:
Over the last couple of years, I started hearing the term “Data Science” a lot here in New York and in Silicon Valley, and decided I wanted to know more about what people meant by it. So I started meeting people in the tech community who called themselves Data Scientists. I also met with the Chair of the Statistics Department at Columbia, David Madigan, to find out what was being done in universities around Data Science. The course (http://columbiadatascience.com/
) came out of those conversations.
Being at Google makes me uniquely positioned to understand what’s going on, and I appreciate what I’m exposed to here. It forces me to understand what needs to be taught-- I could not have done this course otherwise. I also engaged the NYC tech community as guest lecturers to give the students a comprehensive view of Data Science.
My understanding of Data Science has been evolving throughout the semester. Before the semester began, I thought the students would primarily have computer science, statistics and engineering backgrounds, but it turns out I also have sociologists, bioinformaticians, urban planners-- it’s a real mix. And it makes sense, because they’re all grappling with massive data, new kinds of data, and computational challenges. It’s easier to see what Data Science is when you have this kind of broad sample. I think the new Institute for Data Sciences and Engineering at Columbia will embrace this diversity.R@G:
So because Data Science is relatively new as an academic subject area, how do you ensure your students embrace the ambiguity? RS:
I am trying to take a holistic approach, both in terms of the subject matter we cover and in terms of the perspectives they’re exposed to. The first day of the course was standing-room-only, and I admit I was struggling with defining the scope of the course-- a set of practical skills versus a research discipline --but the students have really embraced it. They are deep thinkers; they’re not only trying to learn a set of skills. They’re thinking about science, about how data can solve problems and be pulled into products and policies that shape people’s lives, as well as the ethical issues associated with data use.
Using data responsibly is important to companies like Google, and it’s therefore important for us to educate on the same. Ethics is actually a big theme of this class. They’re doing highly technical work: coding, building algorithms, scraping the web, cleaning up messy data, classifying documents from the New York Times for example, but they’re also thinking about social value, understanding human behavior, and how to communicate their intuition and findings.
One of the companies I’ve involved is Kaggle (http://www.kaggle.com/
). I respect Kaggle both as a company and as a pedagogical tool, so I partnered with them to run an essay-scoring competition in my class. Kaggle provided the dataset and infrastructure; my students are looking at how humans rate the essays, and writing a machine learning algorithm that can grade essays automatically. We get to talk about how good machines are, but also the ethics around predictive modeling.
So to answer your question, it’s less ambiguous to me now. We need a new generation of problem solvers and scientists who know how to handle and find meaning in massive data sets, and do so ethically and with integrity. And I’m proud to be educating them. R@G:
What important problem would you like to see solved with data?RS:
I’d like my beautiful, smart, non-verbal, (severely) autistic brother, Alex, to be taken seriously as the intelligent human being he is and to be given the opportunity to speak. One day, Data Scientists could solve this. Generally I want to see data being used to solve problems of social value.R@G:
What advice would you offer to PhD students who’d like to pursue a career in Data Science?RS:
Know what your own strengths and weaknesses are. Learn the practical things: take computer science classes; learn how to program; do machine learning; but also, learn how to communicate your ideas clearly, and figure out how to work with people across disciplines. I think data is a language, and if you can use technical language in human terms you can solve very important problems.