Shared publicly  - 
How a $500K Investment Netted $3 Billion in One Year

In this blog, I am introducing you to Kaggle, a company that incentivizes the best data scientists in the world to examine your data and help solve your problems.

I first learned about "Data Mining Competitions" from my good friend and +X PRIZE Foundation Trustee Rob McEwen. Unlike me, Rob owns gold mines (yup, gold mines) -- a lot of them. And in 1998 he wanted to understand how much gold was in one of his particular mines, but his top scientists looking at the geological data couldn't tell him. So, he took all of his secret data (normally kept in the safe) and put it up on the web for the world to see. 

Next he put up a $500,000 prize and asked data scientists worldwide to analyze this data and show him where he could find his next 6 million ounces of gold. The data scientists took the bait and the competition was on. Rob had 1,400 people download the data and 125 entries. As it turns out, the top three winners (none of whom, by the way, ever physically traveled to visit his mine) showed him where to locate those 6 million ounces of gold. A $500,000 purse netted him some $3 billion in value in just one year. Now THAT was leverage!

The questions I put to you, are these: 

Do you have a lot of data and a problem that you'd like to challenge 45,000 data scientists to help you solve?

Do you want to find a way to identify whether a song will be a hit? 

Do you want to determine whether photos submitted to your website are any good? 

Do you want to advance research in HIV treatment? 

Do you want to discover where the universe is hiding its mysterious dark matter? 

A machine-learning, data-competition platform called Kaggle can help solve these problems. In fact, Kaggle has actually solved these challenges. 

To get all the details, I interviewed Kaggle's President and Chief Scientist Jeremy Howard. Jeremy is a brilliant data scientist himself. He's an entrepreneur with a thick Australian accent and a background in philosophy and management consulting who'd built and sold startups. His first affiliation with Kaggle was when he competed successfully in Kaggle's early contests. So successful was he in these competitions -- including one that was trying to replace the 50-year-old ranking system for chess matches -- and so enamored was he of the Kaggle platform that after running into founder and CEO Anthony Goldbloom, a fellow Aussie, Jeremy joined the team and moved to San Jose. 

I caught up with Jeremy at +Singularity University for this interview.

"So what is Kaggle?" I asked him. "Kaggle is a new kind of company, which is creating a whole new way of doing work by leveraging the most powerful tools out there -- machine learning and artificial intelligence," he said. "Kaggle has built a platform that allows you to get access to more than 45,000 data scientists to help you with your problems. Throw away your preconceived ideas and think about what ways you can potentially transform your business by leveraging machine learning. Kaggle is a marketplace. All the best marketplaces bring together two groups that are looking for each other. In Kaggle's case we're bringing together people with interesting problems to solve and lots of data to mine, and tens and tens of thousands of data scientists -- many the best in the world -- who enjoy a challenge, who want to look at your data and figure out what's hiding in there."

For those of you who never heard of the term "machine learning," Jeremy provided a quick description on this as well. "Machine learning is the ability for machines (computers) to come up with ways of solving problems themselves just by looking at some data you give them about some causes and effects," he explained. "In the long term it's my strong belief that development of machine learning will eventually lead to strong artificial intelligence."

"Most of the data scientists who compete are either with major universities or running research departments, these guys aren't available for you to easily hire." Jeremy went on to explain that many are scientists who work at companies like Google, Facebook, LinkedIn, Microsoft and Apple during the day. "On evenings and weekends they compete on Kaggle." To date there have been about 100 so Kaggle competitions since its founding in 2010.

"The interesting thing is that very few of these people who are winning Kaggle competitions are first and foremost machine-learning researchers or experts in the particular field we are trying to solve," Jeremy says. "The teams competing on Kaggle are typically analyzing data to try and solve problems on subjects like glaciers or particle physics or electrical engineering." 

What's powerful is how these people can channel their unique expertise in one area into another. For example, one NASA-funded Kaggle competition involved the search for dark matter -- that elusive material whose existence has been suspected for decades but has never been found. 

"People have been looking for that stuff for some time," Jeremy said. "Now, there's really one thing we know about dark matter, which is that it has gravitational pull. And the one thing that we do know about gravitational pull is that it can actually bend light. So some pretty smart researchers realized a while ago that if we look at really distant objects in the sky, distant galaxies, we should be able to detect dark matter by seeing the light from those distant objects being bent, it skews the light from these galaxies. All you need to do to create a universal map of dark matter is to create a universal map of these galaxies, and figure out how much their light has been skewed. And that way you know where this dark matter is and you have a map of it. Now coming up with an algorithm to do this had been attempted for three decades, but nothing significant had been developed." 

To solve the dark matter mapping challenge, in the Kaggle competition +NASA put all of its galactic observation data online asking scientists to come up with an improved algorithm to map the dark matter. "Within three days of launching the competition, our competing teams basically smashed all past research efforts," said Jeremy. 

What was also remarkable was the source of early breakthrough algorithms. "It didn't come from an astronomer or astrophysicist, instead it came from a guy named Martin O'Leary, who studies the movements of glaciers at Cambridge University," Jeremy said. "He's a glaciologist who had developed algorithms to correct for atmospheric refraction and pixelation of glaciers' images taken from Earth orbiting satellites. And when he applied his learning to distorted galactic images, he saw a real improvement over previous algorithms." 

"By the end of three months, 15 teams had surpassed all previous NASA research, all using different approaches. In fact, some particle physicists ended up winning this competition. Their best result was over 300 percent more accurate than NASA's previous best algorithms. All 15 groups went to the +NASA Jet Propulsion Laboratory and worked together with NASA in actually implementing this, dark-matter-mapping algorithms," Jeremy said. 

When I asked Jeremy what problems he thought Kaggle would be most useful for solving, especially for an entrepreneur, here's the list of his top 5 ideas: 

1. Helping an entrepreneur start a business: Kaggle can help you analyze the data of a particular industry to see where openings might exist for new products.

2. Sorting through visual data more quickly than humans. "We held a competition that allowed a company to develop an algorithm that would actually predict which user-generated pictures were more 'beautiful' than others."

3. Tapping into a variety of customizable data products. "One such is the news-aggregation site Prismatic," Jeremy said, "which curates news from all over the Web and then uses machine intelligence to predict which articles you're likely to enjoy and curates a kind of newspaper for you each day that it thinks you're going to like." 

4. Seeing where technology can drive innovation in certain industries, such as automotive. Kaggle worked with Ford, for example, "to identify a system for cars that would automatically identify if you were getting drowsy, or even not alert," Jeremy said. "They had lots of data about vehicle sensors and physiological data and so forth and they wanted a predictive model. They put that data up on Kaggle and within days there were people who were solving that problem. None of them had a background in vehicle safety, or vehicle sensor systems."

5. Identifying how and where machine learning can transform a new or existing business. This is where a new capability called "Kaggle Prospect" comes in. In Kaggle Prospect, "you can run a competition where people come up with ideas on what data competitions to run using your data. So you say, 'Here's our data, here's a snapshot of roughly how our business works,' and data scientists who actually understand machine learning come back to you with different ideas on what insights, solutions and improvements they can extract from your data."

After spending the afternoon with Jeremy, here are my top three takeaways regarding Kaggle competitions:

1. People compete for the challenge, not the money. Most of the data scientists who compete do so in their spare time; they're working at universities or running research departments. 

2. People compete to learn about their process. "If you do really well then you learn that this is an algorithm which is amongst the best in the world, if you don't do so well, you'll find out where your gaps are," Jeremy says. 

3. Leaderboards are a great way to spur competition. "I used to compete myself," Jeremy says, "and I found that there were many situations where I'd thought I'd done the best that I'd possibly could do, but getting passed on the leader board made me find things I didn't know were in me."

In my next blog I'm going to continue writing about Kaggle, but this time I'm going to show you how the platform helped a large insurance company realize billions of dollars in benefits, all for a $10,000 prize. 

NOTE: As always, I would love your help in co-creating BOLD, and will happily acknowledge you as a "contributing author" for your input. Please share with me (and the community) in the comments below what you specifically found most interesting, what you disagree with and any similar stories or examples that reinforce this blog that I might use as examples in writing BOLD. Thank you!
Brad Arnold's profile photoRichard Burr's profile photoMarlo Graves's profile photoWitold Łojkowski's profile photo
+Peter H. Diamandis Competitions between teams have proven effective, but an additional approach that might increase innovation is micro-competitions. Instead of only having separate teams all individually tackling a huge challenge, there could be an additional competition composed of micro-challenges.

The huge challenge could be broken up into smaller components with teams (or even individuals) just focused on that specific component. It may even prove valuable to not even disclose to them what the larger challenge is to prevent any limiting preconceptions they may have from diminishing their possibilities  Share the results of the top three (or five, or ten, etc.) contributors on each micro-challenge to make finding the solution more collaborative while still maintaining the competitive aspects that drive results.

At the same time, separately continue the original competition, and at the end of a pre-specified period, compare results. After this, release all findings and approaches to both the complete challenge and the individual components and begin a second competition (which would end only when a suitable solution was achieved) with this vast new trove of relevant information and innovation. 

This also would allow many more contributors to participate by not restricting those who do not have the time or other resources to compete in tackling the entire challenge. 
Peter, one of the things I love best about this is pulling (or pooling) intelligence from unlikely areas.  The connections one person might make in their own field can be applied to something entirely different in another field and all types of solutions can be found.  It is real out- of-the-box thinking because the other people are already out of the box, (or in this case, field).  Inter-field idea sharing allows for immediate innovative thinking which can lead to further and rapid acceleration of progress and advancements in multiple streams of study, simultaneously.
I had an email conversation with Jeremy last Fall about the trust gaps in the shared economy which proved very helpful in framing our approach. As a leadership team, we made the decision to approach a building a prototype with a small development team, anchored by a world class data & machine intelligence scientist who is now on our team. Given that our ultimate goals reach far beyond just the sharing economy, we will definitely be taking our initial efforts back to Kaggle with the goal of improvement by leaps as opposed to incremental.

Combining structured and unstructured data to consistently elicit a complex and deeply individualized emotional response is a challenge. Accomplishing this while still maintaining simplicity as a core functional design element really puts the challenge into the "interesting" classification. We believe the Kaggle platform can help us get this done.

The other "opportunities" I believe would benefit from the involvement of Kaggle are those where saving lives is an deliverable. For example, tremendous benefits will be realized from greater accuracy in predicting climate related events such as hurricanes and tornado formation, earthquake predictions, traffic management, autonomous vehicles, and integrated multi-modal transportation & urban planning.

Another interesting area for rich data science is the revolution underway in higher education. As more and more courses become available on line for free, might we see a decline in traditional degrees? If so, hiring and retaining solid talent will require new models for selection and incentives. 
I have one thought which I would like others to consider.
To what extent can we use crowdsourcing like this?
Currently, Kaggle's main crowdsourcings are basic scientific facts and algorithms. But I think this monopoly will end soon. The next up for the stage will be data that isn't so basic. 'What is the best way to customize people's channels' is a question we're familiar with. But extent of these branches will mean crowdsourcing of people's private data. 'What kind of house will this man buy' 'What is the likeliness of this person commiting a crime again' These kinds of solutions are good, surely. Personally I think these kinds of questions are the ones that need systems like crowdsourcing is suited for. But I've seen people with different opinions. People who don't like their private data getting so open. Who'll risk so-called 'convinence' to keep themselves secret. I don't have a solution to how these kinds of questions should be addressed. After polls, after signing agreements-which you don't you'll not be able to use their service entirely-, after a global guidline is set, .... I'd like to hear what solutions YOU could come up to these kinds of projects.
As I said above, I do not think crowdsourcing will be limited to basic science. And they don't have to be science at all. For example, literacy could be crowdsourced. +The 39 Clues , a book series I like, is a series that has the same characters, but with different authors for each book. This is one kind of crowdsourcing, am I right? But if these kinds of projects get more diverse, I belive new kinds of problems would occur. Who would be able to patent a crowdsourced solution? The debate between +Samsung USA and +Apple inc. is already hot. 'If Apple patented the circle, then tires would be rectangular(BBC)' For example of this, if Apple's rectangular design has come from crowdsourcing, it'll be harder to group if Apple's design is 'innovation' or 'common sense'. For books, who will earn credits to participations? Many of such people don't want money, as this example for BOLD says out loud, but some people's mind change when they see money. I would like to address this question also, and to hear what other problems might occur and solutions be proposed.
I would like to benefit from the possibility of industry analysis with a view to making some informed decisions.
Would Kaggle help me analyse the Import Export Agent industry to see the potential investment opportunities there please.  I am based in the UK and would like to go into this kind of business. I would if possible be connected to successful people in this industry who can put me through to success.
If there is a good example of what Peter tries to convey with his ideas in Abundance, this is it.
I have often wondered why the subject of History is not considered a big data problem begging for a solution. I know that History is diverse, varied, language based-therefore not really data in the context under discussion here - and some events will remain mysteries - but, wouldn't it be cool to 'see' history modeled as a matrix. To be able to click on a node in one history matrix and follow the intricacies of a specific era in detail. I can imagine that History professors everywhere would wax apoplectic at the absurdity of such an idea.
Hey, Tad Auker, I like that.  What an intriguing idea; mapping history as a gigantic mesh of decision trees.  That way you could select a node and then see what might have happened if a different decision had been applied at any given node.  Nice.  A cause and effect machine.
Anyway, that's not the idea I was going to contribute.  What I was wondering about Kaggle is whether you could use it to reverse engineer "preferred solutions".  Sort of like risk management in reverse.  For instance, my wife works in Aged Care and she is always talking about how we don't learn well and that quality of care hasn't really changed much.  If I had the chance I would take all the injury and care-related information that is available for the Aged Care sector and "plug it into Kaggle" to do things like test standards and recommendations (for things like what floor coverings should be used in rooms frequented by care recipients with compromised dexterity) to see if over time the standards themselves are changing to relfect the emerging best practices as evidenced by the data.  A kind of feedback loop to test whether recommended standards are in fact appropriate or whether our application of each generation of standard is achieving any measurable improvement.  You could apply the same sort of feedback loop to things like the standardizing algorithms applied to educational and academic gradings; is the bell curve appropriate or would Kaggle illustrate a better standardizing algorithm?
Modelling how good is your model.
In the case the ERP that offer SAP, Oracle and Microsoft AX all  have a module to make any kind of projections; the most curious is that generally and although these companies (SAP, Oracle, Microsoft) have experts, they have not implementing functional models for the prediction of production, sales, finance, etc. Information cube can be excellently designed, the platform to make the model could be excellent too, but the employee that makes the specific model for the client is usually a person who is only systems programmer and do not has the industry vision, that is way the model almost is sub optimized, coupled with the bad training end-users receive.
I have a group of 20 friends who have a company, 70% are doctors in mathematics and 30% have master degrees in systems programming; they make models for projections for a lot of companies around the world, the main problem for them is the definition of the challenge to be solve. Many times their customers does not know what requires, do not have any idea about what to forecast or told other ideas to my friends; then my friends make models that are not suited to what requires they clients ( problem that is generated by the same client). With my friends the projections always works better than in the case of the ERP's but in the implementation sometimes the client could no run all the functions of the system because the technology offered by my friends is more advanced than the client could understand.

I think the Kaggle’s challenge is to be a link between technology supplier and customer, evaluating that the system works property and that people who will use the system (projections) could use it.

The world needs companies that evaluate the prediction systems and that help their clients could be able to use them.
Peter, very interesting article, that you. 

Is there any danger competitive enterprises being able to procure important information about rival companies that post competitions on Kaggle? 
I love it! I'm an intuitive who channels pieces of the great puzzles in health, science, and consciousness...this mining of data could help bridge the gap between science and spirit...and give some of the documentation for my work so that funders for my clients have more concrete data to make decisions from...and the speed of innovation can quicken.
Okay, darn it. The potential you reported here has me so excited I got up an hour early for my 4am meditation and of course, included this in it...and here I am writing at 3:59 when I should be back sleeping. Back in the early 80's I dated a guy who was moving to Colorado to study "Artificial Intelligence". I was Not impressed. "What are you going to do with That? And what limited means of artificial intelligence could produce much good?" (Okay, I had visions of the "Hal" scene from 2001 Space Odyssey - and couldn't imagine much good that intelligence without Divine Spirit could do) To this day my lack of vision and imagination haunts me. And I'm glad Someone wasn't listening to me! 

There is an area of study that I've been directed to focus on...but I need assistance to mine through all the data. I'm already electro-magnetically sensitive, but had a tick bite a few months ago that has resulted in Lymes and now the sensitivity has increased to the point that I'm limited to 2 hours/day a wired computer. The area of study is probably contributing to 80% of all chronic illness today as well as loss of Vitality not only in humans, but on all life on Earth (yes, a BOLD statement). I love the collaborative spirit being demonstrated with Kaggle and the astounding results. Interested in a focus study for your book? An example pulled out of your crowdsourcing?
Seems very similar to, except that Kaggle seems to cater more to teams rather than individuals.
I'm a futurist ,technology is our way to solve  every problem we have; We've actually developed an artificial intelligence before we know it.It's amazing how Kaggle is expanding possibilities  and unleashing talents.The best is still to come.Thank you ,Peter.  
I will check it out! Anything related to scientific data pools, technology's ethics etc  interests me a lot. Brainstorming continues..
Very interesting. Kaggle would be great for almost any organization - public or private - dealing with lots of data, to help understand what is hidden there or help resolve problems related to the area of the data.
What I would like to know is if one could place a competition for collecting ideas of how to use computational thinking for solving everyday’s problems. It would be also interesting to find the most promising methods for teaching technology to kids and that not necessarily in the schools.  For example what is the best method to teach kids how to create a (simple) robot?
I checked the webpage of Kaggle. Both quests actually running on the Kaggle relate to operation’s efficiency in industries. I think that improving efficiency of operations in organizations provides huge business opportunities.
I provide an example: 
Organizations can provide strategic business process for analysing them for incompleteness.  Algorithms can then be developed which will help optimizing business process for achieving specific goals. Same algorithms can be enhanced for suggesting business process adjustments as a reaction to market condition’s shifts or to new technologic trends.   A kind of automated and continuous Monte Carlo.
+Guy Fraker Interesting point about the degree evaluations of higher education. As Massive Open Online Courses (MOOC's) become available to people outside of the University setting, I feel that potential employers must consider the "unofficial" classes when hiring new employees. Change is inevitable in way employers view the education of their potential employees.
This whole discussion has been quite instructional and very thought provoking. We went ahead and launched a "beta". To say that the science behind even this MVP is complex, would be an understatement. However, we are seriously considering allowing a form of "open sourcing" this moving forward, which aligns with many comments in this thread.

So i'll put this on the table. We have "buckets" of unstructured data, combined with structured data, resulting in, not a reputational analysis, but a behaviorally predictive score re the ability of others to trust a person. Now we already encourage people to set their own weightings for the initial 5-6 "buckets" of data we pull (but don't retain). What if we added 3-4 blank boxes such that anyone could add in their own "buckets"? Think non-U.S. social media sites or College Teamwork systems. The net impact only improves the accuracy of their score, so they earn additional rewards for that voluntary contribution to greater transparency? In this scenario- the hard core science begins the evolution forward,. Would this not positively impact the overall engagement & credibility of the value prop? You all have given us quite a bit to consider! Thank you!
Wow! Again, the possibilities are endless.

We could expand into the soft sciences. This can be used to benefit psychological research, a discipline that is constantly struggling with how to quantify and analyse phenomena.

Starting from psychology, marketing is only a small step away. 

The comment above about viewing history as a huge data set is genius.

Every complex system—the environment, space, (bio) chemical systems, maps & transport, epidemics, the human psyche, biodiversity, the blogosphere, you name it—could be analysed this way. 
I think we’re on to something here.

A few things I’m asking myself:
Could this help small businesses?
Or even individuals?
Could this be done on a smaller scale?
How can we somehow get involved?
I know this is pie-in-the-sky, but I've always wanted to figure out how to create a double decker double helix DNA strand.

First, there is a site called Foldit, where users create protein strands, which has been used to crowdsource the creation of new proteins.

Second, (I hope you can stomach this) there appeared a crop formation called the Chilbolton formation which showed a DNA strand that appeared to have four strands (i.e. two double decker strands).  In my opinion, this formation is legitimate, but I understand if you think I'm crazy, which is really beside the point.

Third, the advantage to this type of alteration to DNA is obvious and subtle.  Obvious because it would hold A LOT more information, and would also be suitable for maintaining the integrity of the information.  Subtle because this design would be very resistant to radiation and other types of genetic damage.

I could write a book on this subject, but the long and short of it is that if any type of organization or company could make a double decker double helix DNA strand, it would be the biggest development in the history of mankind.

I have lots more information and insight if you want to contact me: BTW, I also know that 99.99% of the people reading this post will think I'm crazy and what I'm writing here has no merit, so I didn't put a lot of effort in the posting - whereas the concept is well developed, but hasn't been very well presented here.  Just another mustard seed planted...
Kaggle reminds me of GitHub.  Github allows the user to share open source software code.  Users can upload or download code depending on their needs.  I think politics could benefit from a source like Kaggle.  Citizens could draft or edit legislation.  The government is notoriously behind the rest of the public when it comes to drafting legislation affected by new technology.  A Kaggle-like platform would be great for genomic work.  The genome will have a lot of data that will take hoards of skilled people to go through data - more people than what a government of company would want to subsidize.  
Kaggle offers a platform where data scientists can gather and solve problems.  Wouldn't it be great if there were a platform where people could upload data.  I think people who take medicine should have an online platform where they can report on the benefits and side effects of their medicine.  Clinical trials are notoriously skewed I think more honest data could be gathered if people reported their own side effects online and shared their information with each other anonymously.  The crowd might unearth a problem long before a pharmaceutical company does.
This is realy fantastic! I haven't understood how do You know which of the proposed solutions are the best /good ones.
Add a comment...