Profile cover photo
Profile photo
Matthew Brett
Matthew's posts

"At a recent workshop on mixed-effects models, a prominent psycholinguist [G. T. M. Altmann] memorably quipped that encouraging psycholinguists to use linear mixed-effects models was like giving shotguns to toddlers".

Barr et al (2013) "Random effects structure for confirmatory hypothesis testing: Keep it maximal" Journal of Memory and Language 68, 255–278.

Bundle it up and send it off

[Working in a statistical working group during the second world war] was a very important move for me, partly because I got to work with Jimmie Savage, and partly because I learned to work with problems - real problems - quickly. Problems came in and they always had a deadline on them, and I learned that you had to do what you could, and then write it up and send it off because it wasn't going to be of any value later. That was a marvelous thing for me. Until then most of my statistical research wasn't quite finished or quite good enough to display. After that I learned to bundle it up and send it off.

Francis J. Anscombe (1988) "Frederick Mosteller and John W. Tukey: A Conversation" Statist. Sci. 3(1), 136-144.

The principle of 'bundle it up and send it off' is also a fundamental feature of agile programming. I think this is another piece of evidence that we should apply agile principles to the process of research.

Big data and (not) knowledge

"In any discussion of massive data and inference, it is essential to be aware that it is quite possible to turn data into something resembling knowledge when actually it is not. Moreover, it can be quite difficult to know that this has happened."

2013 National Academies report "Frontiers in Massive Data Analysis". The National Academies Press, Washington, D.C.

I believe this understates the problem. It is not only especially easy to misinterpret huge analyses, it is especially tempting. "Garbage in, gospel out" is the phenomenon in which the data and the analysis have become so complicated, that it is no longer possible to reason about the output. In order not to look foolish or confused, the consumer of the output has two options. One is to spend a very long time working through the data and fundamentals of the analysis in order to work out whether there could have been a false assumption or incorrect analysis step. The other is to assume that the conclusions are correct.

Post has attachment
What is data science?

An excellent summary by Zoubin Ghahramani in a 2015 panel discussion at the Royal Statistical Society. Below is a Youtube link starting with his summary at 25.56; he finishes speaking at 31.28.

"... academic statistics may have lost its way."

See my previous post on the statistician Leo Breiman, and his background in consulting for industry :

Breiman talked about data analysis in industry and academia in an interview with his colleague Richard Olshen:

"Olshen: ... What advice would you give to a young person today who wants to continue in your traditions? What should he or she study and why?

Breiman: Well Richard, I’m torn in a way because what I might even tell them is, “Don’t go into statistics.” My feeling is, to some extent, that academic statistics may have lost its way. When I came, after consulting, back to the Berkeley
Department, I felt like I was almost entering Alice in Wonderland. That is, I knew what was going on out in industry and government in terms of uses of statistics, but what was going on in academic research seemed light years away. It was proceeding as though it were some branch of abstract mathematics. One of our senior faculty members said a while back, “We have to keep alive the spirit of Wald.” But before the good old days of Wald and the divorce of statistics from data, there were the good old days of Fisher, who believed that statistics existed for the purposes of prediction and explanation and working with data.

Before you came this morning, I pulled out Webster’s dictionary and looked for the definition of statistics, and here is how it goes: “Statistics, facts or data of the numerical kind assembled, classified, and tabulated so as to present significant information about a given subject.” When used with a singular verb, it is, quote, “The science of the assembling, classifying, tabulating, and analyzing such facts or data.”

Now, little of that is going on in the academic world of statistics. For instance, I was looking at The Annals of Statistics and I estimate that maybe 1 paper in 20 had any data in it or applied the methods there to any kind of data. The ratio is not much higher in the Journal of the American Statistical Association. So my view of what’s fascinating in the subject of statistics and the common academic view are pretty far apart.

In the past five or six years, I’ve become close to the people in the machine learning and neural nets areas because they are doing important applied work on big, tough prediction problems. They’re data oriented and what they are doing corresponds exactly to Webster’s definition of statistics, but almost none of them are statisticians by training.

So I think if I were advising a young person today, I would have some reservations about advising him or her to go into statistics, but probably, in the end, I would say, “Take statistics, but remember that the great adventure of statistics is in gathering and using data to solve interesting and important real
world problems.

Right at the end of the interview:

You know, sometimes I feel sad about statistics.There are so many smart people in it and I hope it gets better before it gets worse.

Richard Olshen (2001) "A conversation with Leo Breiman" Statistical Science
16(2): 184–198.

Industry as the origin of "data science".

Leo Breiman was a statistician who was interested in algorithmic models. He invented, among other things, the random forest method. Breiman did his first degree in Physics, a PhD in mathematics, and then worked for seven years as an academic probabilist. He resigned his academic position and worked for a further 13 years as a freelance consultant in data analysis, before returning to academia as a statistician in UC Berkeley. In his 2001 paper "Statistical Modeling: The Two Cultures", he describes what he learned from his time working as a consultant:

As I left consulting to go back to the university, these were the perceptions I had about working with data to find answers to problems:

(a) Focus on finding a good solution - that's what consultants get paid for.
(b) Live with the data before you plunge into modeling.
(c) Search for a model that gives a good solution, either algorithmic or data.
(d) Predictive accuracy on test sets is the criterion for how good the model is.
(e) Computers are an indispensable partner.

Leo Breiman (2001) "Statistical Modeling: The two cultures" Statistical Science 16(3), 199–231.

It seems to me this is a manifesto for what is currently being called "data science". I wonder whether this is one case where industry has injected urgency and rigor into the process of analysis.

Post has attachment
Tukey (1965) on the importance of computing in the education of statisticians:

Most of the technical tools of the future statistician will bear the stamp of computer manufacture, and will be used in a computer. We will be remiss in our duty to our students if we do not see that they learn to use the computer more easily, flexibly, and thoroughly than we liver [sic] have; we will be remiss in our duties to ourselves if we do not try to improve and broaden our own uses.

Post has attachment
In 1965, Tukey predicted the rise of computer science at the expense of statistics:

Today, software and hardware together provide far more powerful factories than most Statisticians realize, factories that many of today's most able young people find exciting and worth learning about on their own. Their interest can help us greatly, if statistics starts to make much more nearly adequate use of the computer. However, if we fail to expand our uses, their interest in computers can cost us many of our best recruits, and set us back many years.

Post has attachment
This figure is a diagram of a system to analyze phone call data from Chambers (1998) "Computing with data: Concepts and challenges":

Note the caption - this horrible mix of programming languages is the "wish list". I think it would now be reasonable to replace Java / C++ / Perl / Statistical language with "Python with suitable libraries". 

To be a mature data analyst, you must also be a programmer:

One additional general point needs emphasis: in modern computing, there should not be a sharp distinction between users and programmers. Most programming with statistical systems is done by users, and should be. As soon as the system doesn’t do quite what the user wants, the choice is to give up or to become a programmer, by modifying what the system currently does.
Such user/programmers then naturally go through stages of involvement. In the first stage, the user needs to get that initial programming request across to the system, quickly and easily. Later, the user needs the ability to refine that first request gradually, to come closer to what was really wanted. Good software for computing with data should support all such stages smoothly.

From "Computing with Data: Concepts and Challenges" (1998) by John Chambers -

Later in the same paper:

It may seem obvious, perhaps even trivial, to assert then that languages, and in particular good languages, are the heart of successful computing with data. I believe the statement to be true, but it is not easy to defend explicitly, and much activity in computing with data (or equally in other kinds of computing) goes on as if it were not true.
Wait while more posts are being loaded