Shared publicly  - 
New blog post: "SPARQL and Big Data (and NoSQL)" Please add any comments here.
John Cowan's profile photoDan McCreary's profile photoBob DuCharme's profile photoInge Henriksen's profile photo
Great observations indeed and agreed, yes! I also enjoyed the Seven Databases book a lot and hope that +Jim Wilson will consider taking SPARQL/RDF stores for the next version into account ;)
If you need another book on NoSQL I am working on one for Manning Publications.  We have the first five chapters in an early access program.


I like what you had to say about creating SPARQL endpoints for NoSQL databases.
Relational databases are usually layered over sparse row stores of the type you describe. There are blocks containing an integral number of rows plus possible slack, as the rows are not fixed in length.  Each row contains a column ID, possibly followed by a type indicator, followed by unpadded data; rinse and repeat.  NULLs are not represented at all.  The only difference from a sparse row store is that the column ID is specific to the table rather than being global to the table.
What I haven't seen to date is a good article/etc. on how to use RDF/OWL/SPARQL to do Map/Reduce.  RDF's ability to auto-merge should help in combining the 'Map' results ready for the 'Reduce' phase, but there's many ways you might conceive of this working, and no real guidance I've yet seen on what actually has been found to work.

For example, you first have to decide how you would partition your data among a grid of triple stores (actual or virtual).  Then you have to work out how you would query those individual stores, whether you would send the same 'map' query to all stores, or whether you would send different queries to different stores with different data.

After that query, you have to decide what to do next.  You could enrich each set of 'map' results using ontologies or additional SPARQL queries (or both), then let the results auto-merge, optionally do another phase of enrichment on the merged results, then do your final 'reduce' query or queries.

There are a number of decision points to deal with, and there are a number of phases that have to be controlled somehow.  Having something concrete that can be suggested as a 'best practice' for Map/Reduce with RDF/OWL would make it easier for people to compare with other NoSQL technologies that are being used with Hadoop/etc. for the same purpose.
While we're quoting +Edd Dumbill, how about 

"Processing RDF is therefore a matter of poking around in this graph. Once a program has read in some RDF, it has a ball of spaghetti on its hands. You may like to think of RDF in the same way as a hashtable data structure -- you can stick whatever you want in there, in whatever order you want. With XML, you expect to get a tree with its elements in a predictable order. If an expected element is missing, then it tends to render your whole XML document invalid with respect to the schema you are using. If an element you don't expect to be there is present, then again your document becomes invalid. With RDF, you have no particular expectations; if the particular strand of spaghetti you're looking for isn't there, the rest of the ball remains. If there are new strands you don't expect, you won't even notice their presence if you don't go looking for them."

I find the hashtable analogy very apt. 
...good piece, just wanted to emphasize +Anthony Coates remark: I think we in the semantic world need some kind of bridge to data analytics & machine learning techniques... I have seen thiese links more on a research level (but not so much in a working product/solution level).
The more savvy NoSQL folks play with bits of schema in a comparable way. We interviewed Nathan Marz of BackType (now part of Twitter), and back then he was using Thrift to modularize social graph data via bits of schema:

Ideally, more of the intelligence would end up out on the web rather than in an app or some type of one-off analysis that doesn't get shared. What's unfortunate is that the most talented NoSQL developers create these novel ways to do large-scale analytics, but their methods don't become part of a standard canon or sharable on the web.The SemWeb folks seem much more apt to share and collaborate methodically.

Meanwhile, the SemWeb folks generally don't work with large datasets because of latency issues in clusters. I've had the impression that in order to analyze large aggregations of non-deterministic graph data quickly requires a non-partioned supercomputer-style shared memory environment. See the claims at and let me know if I'm wrong about that.

The main challenge for the SemWeb community seems to be how to expand its developer footprint by being more inclusive, and yet still adhere to its principles. the NoSQL folks could use more principles and cohesion. Their environment is quite balkanized, and they're almost always just focused on the task at hand and whether it scales or not for ad serving or whatever application challenge they're faced with. When it stops scaling, they create yet another database type.

If you were to try to build a bridge to the NoSQL community, perhaps there are niches where developers are working with non-perishable data that's worth investing the time to do more with. In analytics environments, maintaining historical data systematically could help. Maybe the SemWeb community could act to curate this data and make it more sharable? Just trying to think of ways to build bridges between communities....
+Bob DuCharme "Deconstructing the Database" [1] is a starting point re., the quest for common ground from the DBMS angle.

The RDF and NoSQL realms share DBMS deconstruction (which includes minimal-ad-hoc-schema based information model) as a common goal.  

Yes, we need better marketing via common terminology. The RDF (endowed machine readable explicit entity relationship semantics) model builds on EAV model (based on implicit entity relationship semantics).  Until the connection between both is accepted by the RDF community in particular, we will continue with the critical bridge building need to address this very old problem.


1. -- deconstructing the DBMS presentation by +Rich Hickey 

2. -- TimBL design issues note covering what RDF isn't about

3. -- URI for this post  via my personal linked data space  which you can import into yours  etc . 

#LinkedData   #RDF   #NoSQL   #NoSilo   #DBMS   #CoRelational   #RDBMS  
Great link!  Rich Hickey, the author of Clojure, really nailed it in this video how we need to move to an assertion-based information model, not just a data model.  We just need a standardized functional and declarative programming model that does not have LISP syntax.
+Anthony Coates Hi Tony--instead of building an RDF-based MapReduce algorithm, I think it would be lower hanging fruit for someone to do for one of the existing MapReduce engines what D2RQ did for relational database: build a middleware layer that lets us send SPARQL queries to it. Like I said, turns up some interesting things--mostly academic, but a start. These people need encouragement.
+Bob DuCharme -- if there's ODBC or JDBC access to an Hadoop data server, the rest boils down to RDF or RDF based Linked Data mapping using R2RML tools. This kind of mapping has existed for a while, beyond D2RQ which is one of many options.
+Bob DuCharme -- also note share-nothing based horizontal data partitioning and parallel (ideally vectorized) execution is what cluster edition DBMS engines bring to the table re. massive dataset handling. Basically, that's what we've done for a while re. Virtuoso's cluster edition. 

Hadoop isn't the only game in town, far from it :-)
SPARQL/RDF is fantastic because at some point you are going to ask a question which involves information stored into two or more systems. At this point SPARQL has unbeatable latency in the eyes of end users. Because, integration is a single query which often takes minutes to hours to write. While other the other option is to download, parse, integrate and then query the joint data. Something that often takes a programmer and team weeks to do correctly. So while the query might take a day to run via federalized SPARQL and ms in a central Hadoop cluster for the end user he still gets an answer weeks earlier.

Data stored in two separate SPARQL endpoints is easy to integrate via SPARQL 1.1 service keywords over HTTP. Combining data in two Hadoop clusters over HTTP even when using Hive, Hbase or pig is not an easy task. And that assumes both data sources are hadoop...

I am starting to see people who combine data from two or more SPARQL endpoints and their personal spreadsheets in the wild. I have not seen anyone who do this with other NoSQL solutions.

+Alan Morrison Yarcdata sells large unpartioned memory machines. For sub second latencies to random queries on large datasets that may be required. Many people are happy with latency measured in days as long as they get an answer. An answer that current systems are not providing. 

The last major benefit of RDF/SPARQL is that it is storage layout agnostic. SPARQL against RDBMS check, SPARQL against HBase check, SPARQL against specific java API check, SPARQL against webservices check, SPARQL against a key value store check. No other query language works against such a variety of technical solutions. And this means it works for both Big and Smart data ;)
Great post! The most telling point for me was about needing to "market" to the Big Data and NoSQL communities.

I don't see the technical prowess of SPARQL as addressing that issue.

That is assume SPARQL can do all the things the commenters have described. How do those fit into the Big Data/NoSQL communities?

More precisely, how does SPARQL fit into the social, economic and political arrangements in the Big Data/NoSQL communities? Whose arrangements will be disrupted? Who will be advanced?

Not "technical" questions in any sense but I suspect important ones for the promotion of SPARQL or other semantic technologies.

+Patrick Durusau -- the challenge is ultimately about discovery, exchange, and general exploitation of insights culled from Big Data. Basically, the 4th "V" which stands for "Verity". 

Linked Data provides a more fine-grained mechanism for data access, representation, integration, and dissemination. 

SPARQL is a query language which is vital in two areas:

1. data access and manipulation -- just like SQL is for the more limited RDBMS realm

2. an optional implementation detail for associating a resolvable (de-referencable) entity name with an entity description (or descriptor) document comprised of entity-attribute-value or more semantically rich RDF graph based content. 

Back to NoSQL and friends. The are taking a different to the same destination. Ultimately, hyperlinks are going to be used to enhance the following re., insights discovery and dissemination:

1. entity names
2. entity description (or descriptor) document content.

With the above in place, you are going require a query language that goes beyond SQL due to nature of the heterogenous nature of this kind of data across entity relationship semantics and data source dimensions. 

This is still ultimately about the Web as a query friendly DBMS connecting heterogenous data sources. Basically, the Web as a virtual DBMS of the kind most dreamed about many years ago. 
+Kingsley Idehen -- When you say:

"This is still ultimately about the Web as a query friendly DBMS connecting heterogenous data sources."

That is a goal that you, me and many others find interesting. But my point wasn't how to market to any of us.

Or how to market to the "big data" crowd when you get down to it.

There are non-"Web as a qeury friendly DBMS" and non-NoSQL types who are buying NoSQL and not the "Web as a qeury friendly DBMS."

Extolling the virtues of SPARQL fails to address the needs/wants/desires that underlie that difference in buying behavior.

Caveat: If I knew the answer, it would have been on the first line of this comment.
+Patrick Durusau --  You get Big Data folks to understand their moniker ultimately boils down to the utility of the Web as a globally accessible query friendly DBMS. A DBMS where URIs are Data Source Names. That's basically it. 

Of course, the paragraph above doesn't in anyway mean the end of politically driven cognitive dissonance. The only way to tackle that is via palpable opportunity costs being incurred by those who seek to play the aforementioned game. 

Papers don't fix cognitive dissonance. Neither do discussion threads. This ailment is killed via real world solutions that demonstrate the utility of the concept being promoted etc..
+David Lee Any XML-based system that you plan to scale up will require lots of careful schema design before you accumulate any data, and after you start accumulating, schema changes are not something to take lightly. And, aggregation of different XML from different people--even if it's data about the same things--requires lots of conversion coding before you can use the aggregate data, unless they all used the same carefully designed schema to store their data. So, it doesn't really fit the scenarios I've described.
Much of this how we also think. With our solution we try to find the sweet spot between Big Data, NoSQL, and the SemWeb standards like the SPARQL query language. We also believe that not having a standardized query language is a big (and not widely discussed) problem in the NoSQL scene, just look at what happended to object-oriented databases. Espesially enterprise customers are known to not like vendor lock-in's as will happend with a DB that uses a non-standard query language or API, and when they choose a standard like SPARQL they avoid just that.
I see you mention neo4j, which is a great graph DB. I did a bit of benchmarking on neo4j as a triple store (comparing it with open RDF Sesame native store and another nosql graphdb, OrienDB).
The tinkertop blueprints API provides a RDF API on top of neo4j and orientDB.
This appeared to me as a very promising alternative for the storage layer of triple stores. Unfortunately the first figures give the advantage to openRDF Sesame native store, which is not among the strongest players in term of performances; neo4j arrives in second position (3 minutes behind), and orientDb is last, offering very poor performances (19 minutes behind) (and bugs). 
These measures are times for loading trig files. The same difference of performances are visible when measuring query response time.
Still need to wait for improvement in this field !
+Elie Naulleau -- wondering if you saw the technical report we released about Virtuoso's column-store edition (different from the current row-store which is a screamer in its own right).


1. -- loading 50+ Billion Triples into the Virtuoso Column-Store edition

2. -- some Virtuoso Column-Store edition benchmark results. 
Well, after some tinkering time and version upgrade (!searchin/neo4j/blueprints/neo4j/g8bV8w3LH9E/WIgx5GP14KAJ) it turns out that neo4j coupled with blueprints+graphsail has quite good performances.
To me this is a good piece of news because paradoxally enough, SPARQL 1.1 can tell you whether two concepts are connected, but it cannot print out the path way of the connection. Neo4j being at first a graphDB, it can do path search. I've heard though that Virtuso has an extension to do this. Thanks for the links.

Very good post, particularly as it (and the responses to it) prove out somethings I've suspected for a while. The first is that much of the XML community has been quietly migrating to RDF/OWL/SPARQL for the last couple of years. I suspect this is primarily because most of us our now thinking about big data systems and dealing with large scale heterogenous XML groves (which is essentially what an XML DB is, after all) and seeing relational mapping and inferential processing as being the biggest bottleneck that emerges when you deal with large numbers of XML documents in a data store.

Hadoop has a critical role in ETL, but I see a systematic progression in the Big Data space. Hadoop and other M/R solutions are reasonably good for taking non-structured content (unmarked-up or minimally marked-up) content, performing entity extraction, document enrichment and other NLP processing, and then indexing that content. What typically happens at that point is that each community's preferences for data storage get in the way - the RDBMS types are trying to get the data into relational tables, the XML and JSON document types are more concerned about extracting and constructing discrete data property bundles, while the RDF/OWL folk are more concerned about working with assertions and relational bindings.

The central challenge that RDF has always faced is in decomposing and binding entity relationships, which are usually harder to extract than properties are for a given set of information from raw data.  Where SPARQL 1.1 shines is that with a sufficiently large volume of information, a lot of the implicit relationships can be made explicit and consequently can themselves be objectified as entities. Here I think is one place where Hadoop and SPARQL can work effectively with one another, as the process of objectifying those relationships is asynchronous, distributed and likely near continuous.

I'm not that worried about the NoSQL crowd - in many cases, the people that are working with JSON stores (and similar hash stores) are now dealing with the same domain that XML developers have dealt with for the last decade, and will eventually come to the same conclusion that relationships are as significant as properties. At that point, SPARQL is there, it's JSON friendly, and it solves the more complex problems that arise from relational logic that many web developers generally do not encounter because of scope - they're working at the instance level, rather than at the aggregate.
Add a comment...