Shared publicly  - 
 
The era of "Big Data" has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and many others are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential benefits and costs of analyzing information from Twitter, Google, Verizon, 23andMe, Facebook, Wikipedia, and every space where large groups of people leave digital traces and deposit data. Significant questions emerge. Will large-scale analysis of DNA help cure diseases? Or will it usher in a new wave of medical inequality? Will data analytics help make people’s access to information more efficient and effective? Or will it be used to track protesters in the streets of major cities? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Some or all of the above?

+Kate Crawford and I decided to sit down and interrogate some of the assumptions and biases embedded into the rhetoric surrounding "Big Data." The resulting piece - "Six Provocations for Big Data" (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431) - offers a multi-discplinary social analysis of the phenomenon with the goal of sparking a conversation. This paper is intended to be presented as a keynote address at the Oxford Internet Institute's 10th Anniversary "A Decade in Internet Time" Symposium.

Feedback is more than welcome!
76
103
Erik Duval's profile photoJim Greenberg's profile photoSid Cox's profile photoTamale Chica's profile photo
30 comments
 
And all the good research is behind a paywall. <sigh...>
 
Big data analytics shifts the archetype from a data warehouse to a microscope held up to a firehose.
 
Paying the piper. Those internet pipes aren't free. Speech might be free but beer and the internets, are never free without a cost.

Always complex, there is a base connection to another 'other side of the coin' discussion. The idea that too much media doesn't foster free speech so much as the ability to find support for your own viewpoints on the internet, which is leading towards more rigidly opinionated people rather than tolerance and diversity.

In other words, look beyond the hype. Thank you for turning your lens on our quantified lives.
 
Agree that things can be ... complicated. But that doesn't mean we should stop trying to leverage Big Data? Not sure what the implications of the provocations would be... If your readers would react with "Right... So?", what would you say?

The attention trust principles were a good start (http://p2pfoundation.net/Attention_Trust#Characteristics) for practical advice.

We rely on them in our work on 'learning analytics': basically Big Data approaches to support people - not machines - in their learning efforts...
 
Dana, this is timely for me as I am involved in working with a large aggregation of public data here in Virginia. Thansk for the "provocations".
 
Interesting that you say mathematicians are clamoring: I feel like lots of mathematicians should be clamoring, but most are not. I agree with the quote from Chris Anderson, "This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear." But a lot of mathematicians aren't seeing this connection, in my view, and are happy with their chalk.
 
Big data has been around for many years... Now it is getting attention!
 
I can't hardly wait for some political faction campaigning against Big Data. Stock jobbers trading datum futures....
 
It's notable that some journals (notably in economics) require that any data used in research papers be published and made available to others. See The American Economic Review for example: http://www.aeaweb.org/aer/data.php That's obviously problematic for people who want to use data to study privacy behaviors or medical treatments, but that's only the tip of the iceberg. Most of the data related to human behavior should be regarded as private, so it's going to be a permanent problem in social science research. Privacy researchers have repeatedly shown that it's almost impossible to suitably scrub data to remove privacy sensitivies.
 
+Dan Lovejoy If you click on the link ONE CLICK DOWNLOAD you will get the paper and not just the abstract.
 
Thanks Tamale! (If that is your REAL name ;-) I guess I'm so used to seeing that paywall abstract page, I don't even look for the download link. How refreshing.
 
+Dan Lovejoy I thought the same thing, but it didn't make sense to ask for input from an Abstract, so after dorking around I finally noticed the link. Whew, me preocupe que preguntarias sobre mi apellido, Dan Lovejoy... si Dan es tu nombre verdad, tambien. :-)
 
Ahh - lo siento por las molestias. Mi apellido es "Lovejoy." Es casi increible, ¿no?
Translate
 
When I was in grad school I used to sit next to the demographers at computer terminals analyzing data. We were always jealous of the size of their data sets. This was in the late 70's those data sets are nothing compared to today's big data.
Sid Cox
 
What about natural sciences? We make data too!
 
I've found the tool from Discovertext.com intriguing in this regard. I'm able to tap into the social networking fire hose and grab stuff with it. I suspect we will see more and more tools like this emerging. The questions you ask are important and I have no answers. As I continue to try and use qualitative methods on these growing data, I find myself falling into the quantitative trap.
 
I think that a big problem is that all this data is heavily skewed towards what Henrich et al describe as W.E.I.R.D. people: Western, Educated, Industrialized, Rich and Democratic (http://www2.psych.ubc.ca/~henrich/pdfs/Weird_People_BBS_final02.pdf)

Even if lower income people are getting access to FB, Twitter, etc via, say, public libraries, the majority of people posting to such services will be people with laptops and smartphones... a very small section of a very large population. There's still a big need for one on one interviews and lab studies, but everyone is turning to Tweets and Turkers because it's cheaper.
 
Physical sciences have a long history of "working with large data". Examples include weather and climate modeling, shock hydrodynamics, oil exploration, space exploration, etc. The stock market has switched from being human-driven to being mostly algorithmic trading based on data. The social sciences are just coming into their own in using large data sets for advertising, sentiment analysis, search, etc. It feels more "new fangled" for these disciplines.
 
Does the definition of "big data" imply what that data is about? It seems like you are very focused on social networking related data sources. Is this because the definition of "big data" is restricted to this area? Or is this just one area of big data that you choose to focus on?

This is not a criticism, I'm honestly trying to figure out whether I've been using the term big data incorrectly.

If "big data" can apply to a broader set of domains, I would suggest that you frame your context in the introduction OR expand your analysis to the implications of working with large data sets in other fields/situations (corporate, scientific, government, etc).
 
I think "big data" is much MUCH broader than social networking. Big data is driven by Kryder's law that says the cost of storage is falling at an exponential rate. That allows all sorts of organizations to record and mine data, including the walmarts of the world, governments of the world, internet companies, airlines, Fedex, and Nissan. They will record this data if they think they can derive value from mining it.
Sid Cox
 
Most Japanese basic phones now have facebook, and there are more phones than people in Japan, whom are definitely not Western.
 
I wonder how Big Data will be able to handle soft information that appears to derive from hard data. EG, demographic changes in populations of ethnic diversity. This and other factors can vastly change behavior patterns but without looking at the “why” factor, good insights will be lost by just looking at Big Data. In understanding many subcultures, understanding their behavior is all about understanding the “why.” Population demographics have been rapidly shifting, and will continue to do so. Population demographics and behaviors are dynamic, whereas datasets are static.

While data is cleaned and statistical calculations are applied to supposedly address extreme data in a population, often it is by looking at these extremes, the outliers, that give us insights as to 'the why.' To ignore consumer, or user behavior is to ignore the human nature that is within us.

Your discourse brings up some excellent points, the ethical aspect is certainly not the least of it. Ethical handling and derivation of data has been obtained without informed consent on a regular basis in the US, so this is undoubtedly an area of concern. Thanks for the share!
 
Thank you for the initiative. I was too thinking about this for a while and now i'm glad that someone at that level is putting it on the table.
 
A Sociology Professor on my campus sent me this comments...

This is a really interesting article. It goes along with what I see happening, and that is a methodological breakthrough for the social sciences that is AS big as the adoption of survey methods and random sampling methods. Particularly, as we gain the ability to access and analyze current and real time data from search engines, we can begin to look like social meteorologists when it comes to describing public sentiment. However, I don’t see Big Data as a holy grail—it will not answer the really big theoretical questions. It will provide more ways of examining certain theories, and will give us better prediction models—but these are weak without a theoretical specification of causal mechanisms. One of my biggest concern is that we will become obsessed with public opinion and present fads, which are highly shaped and influenced by the media. It does not offer as much in terms of studying more obscure topics that people are not tweeting about, but that are nonetheless important.
 
Thank you so much for putting this together and even more thanks for sharing it. It is a fantastic read!

Despite our own intuition on the relative dishonesty of people we still retain an illogical level of faith in data produced by them. As you mention, when these people have agendas (as they often do) then they can be motivated to mold it as an unwitting participant in their deception (deliberately or just due to interpretation as you discuss). With big data the statistical power is so high that the likelihood of finding significant results is virtually guaranteed.

Now of course, we should perform tests in order to demonstrate effect size and confidence intervals. But when the pressure is on to win a contract, convince the board, or make the case to the shareholders then who has both the insight, willingness, and power to demand this level of accountability? Do we check, or do we just assume others did?
 
thanks for great sharing.
Add a comment...