Shared publicly  - 
 
So, the inevitable backlash against 'Big Data' has begun. In an article in the Financial Times yesterday, journalist Tim Harford suggests that Big Data analyses are more dangerous than we thought.:

http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz2xS1VXiUc

He points to reports that Google Flu Trends, Google’s system for predicting flu outbreaks from search queries, has become less reliable in recent years. For Harford, big data analyses, such as Google Flu Trends, are more fragile than traditional analyses because the models they produce are less interpretable:

'But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.'

I can understand why talk of 'big data' would lead someone to think that a mysterious and subtle model lay behind Google’s analysis. Crunching through 50 million search queries must have lead to an unbelievably complex model that no one could ever understand, right?

In fact 'Google Flu Trends' deliberately used an extremely simple model. The model was comprised of a linear model with a mere 45 search query terms as inputs. Table 1 from the original paper (behind a pay-wall but attached here), summarises the queries that were used:

http://www.nature.com/nature/journal/v457/n7232/fig_tab/nature07634_T1.html#figure-title.

Does Harford really believe that ‘we have no idea what is behind a correlation' between googling for things like 'Cold/flu remedy' and people actually having flu? I'm not a doctor but I am going to try and speculate about what's behind that correlation: people with flu want to stop having the flu.

The fact that, 5-years on, changing behaviour on the part of users caused the algorithm to become less accurate is not unexpected. Indeed, the authors of the paper warn about this in their penultimate paragraph:

'Despite strong historical correlations, our system remains susceptible to false alerts caused by a sudden increase in ILI-related queries. An unusual event, such as a drug recall for a popular cold or flu remedy, could cause such a false alert.'

I suggest that ‘Big Data’ analyses are no more prone to this kind of problem than any other kind of analysis. 
33
6
Xabier Ochotorena's profile photoYali Sassoon's profile photoJim Downing's profile photoBharath R's profile photo
4 comments
 
There is s.th. even more dangerous... Those algorithms are usually used to discriminate people. Someone gets a better deal than someone else. Yet it's not clear how the outcome is generate, or even if it's clear it's not proven to be non-discriminating. You can run a bank that prefers white 25 years old males. You are not allowed to discriminate during your hiring process, yet you are allowed to analyse the applicants. You won't say "I'm not hiring black people" but "My hiring cloud app told me that this candidate is not a good fit for my company". Yet all the algorithm did was finding out that the person is most likely black and that you don't want to hire black people. Even worse: You build your own filter bubble. But the bubble is now at the center of our economy....
 
+René Treffer There are regulations that prohibit this. Regarding banking - specifically Reg B prohibits age or gender use in modeling. 

Artificial Intelligence (machine learning algorithms in this case) are complements to the human decision making process. Not substitutes. Until AI has advanced to the point of human intelligence, you would be extremely naive to trust all decision making to an automated process. In production applications, random "noise" is baked in and analyzed vs the test population to gauge performance. 
 
+Sam Sachedina yes, it prohibits taking age, race etc as input. But those factors correlate with many variables in life. A simple example: you are less likely to get a credit if you got more money from the state. This is a common banking algorithm that was revealed after an investigation on why Austrian officials can't get a credit. Turns out their salary and the welfare hat the same source. Take the same algorithm to Germany and it will discriminate every mother (gender + age). It will also discriminate age in all countries. Those are algorithms running right now, deciding if you are trusworth enough to receive a credit.
Add a comment...