Profile

Cover photo
Stian Soiland-Reyes
Works at myGrid, School of Computer Science, University of Manchester, UK
Attended NTNU
Lives in Manchester, UK
126 followers|7,146 views
AboutPostsPhotosYouTube

Stream

Stian Soiland-Reyes

Shared publicly  - 
1
1
Mark Fortner's profile photo
Add a comment...

Stian Soiland-Reyes

Tools & Services  - 
 
 
ORCID as #FOAF profile w/#PROV and #PAV now live as experimental feature.

(I'm working on fixing the http://orcid.org content-negotiation redirects)
2
Add a comment...
 
I fully agree with +David Karger's critique of the Semantic web:
We need more UIs and practical applications (more web) and less focus on rules and models (less semantics).
#eswc2013
4
Add a comment...
 
 
Posted http://blog.schema.org/2013/06/schemaorg-and-json-ld.html

"Schema.org and JSON-LD
We'd like to take a minute to share our enthusiasm for some recent work at W3C: JSON-LD. 

Schema.org is all about shared vocabulary - it helps integrate data across applications, Web sites and data formats. We are adding JSON-LD to the list of formats we recommend for use with schema.org, alongside Microdata and RDFa - each has strengths and weaknesses for different usage scenarios.

In HTML, schema.org descriptions can be written using markup attributes in HTML (i.e. RDFa and Microdata). However there are often cases when data is exchanged in pure JSON or as JSON within HTML. W3C's work on JSON-LD provides mechanisms for interpreting structured data in JSON that promotes interoperability with other data formats. We believe it provides value for developers and publishers, and improves the flow of information between JSON and other environments. 

There are some technical details to work through on how exactly schema.org terms are defined for JSON-LD usage, but it is already clear that JSON-LD is a useful contribution to structured data sharing in the Web. Many thanks to the hardworking W3C community for creating the specification."
We'd like to take a minute to share our enthusiasm for some recent work at W3C: JSON-LD. Schema.org is all about shared vocabulary - it helps integrate data across applications, Web sites and data formats. We are adding JSON-...
1
Add a comment...
Have him in circles
126 people
Rishi Ramgolam's profile photo
Helena Deus's profile photo
Ian Foster's profile photo
Gudmundur Thorisson's profile photo
Reinout van Schouwen's profile photo
Marco Roos's profile photo
 
10 Best practices for workflow design: How to prevent workflow decay. Also see Kristina's paper at http://ceur-ws.org/Vol-952/paper_23.pdf
1
1
Taverna's profile photo
Add a comment...

Stian Soiland-Reyes

Shared publicly  - 
1
Add a comment...

Stian Soiland-Reyes

Shared publicly  - 
 
ORCID as #FOAF profile w/#PROV and #PAV now live as experimental feature.

(I'm working on fixing the http://orcid.org content-negotiation redirects)
1
1
Stian Soiland-Reyes's profile photo
Add a comment...
 
Recommended reaad: +David Karger's harsh, but true critique of the #SemanticWeb - we need more web, not more semantics.
#eswc2013  
1
Add a comment...

Stian Soiland-Reyes

Shared publicly  - 
 
 
In my role as a UniProt developer I was asked a question about why use SPARQL+RDF. I thought it could be interesting for others as well.

                              *
I will be giving another talk about this at the biohackathon 2013
(http://2013.biohackathon.org/documents/symposium). I hope this will also be made available on youtube by the kind DBCLS.


                               ***
               The following is my personal opinion only!
               I wear some rose tinted glasses in relation
               to SPARQL. But that is just my blood from
               banging my head on the relational/flat file
               walls.
                               ***


I understand the NCBI policy makers. Many of the benefits of the semweb they heard before. Use ASN.1 its such a great standard. Oh sorry nearly no body uses ASN.1 use this XML thing instead, it will be so easy to query your data with Xpath. In the meantime most users use the flatfile genbank or medline files... And you can't really deprecate a format once published (at least not without an outcry).

When I started at UniProt just over 5 years ago I thought the same. Oh great file format number 8 [1], do we really need another one? (I can already hear the sigh coming from some of the experienced NCBI developers).
Today I say yes and the RDF one is the future of the UniProt formats (far future, but future nonetheless).

Yet you must understand that using SPARQL or SQL is not an interesting change in terms of biological science. There is theoretically nothing possible using SPARQL that is not possible using SQL etc... or even clay tablets. The only thing that changes is the number of slaves, oops I mean PhD students that are needed to get a result. I claim SPARQL+RDF is more economical efficient in the aggregate than SQL. Which is why I support this move.

The same reasons that programmers mostly moved from C/Fortran to Java or Perl and then in part to Ruby and Python. It is really hard to make the argument that moving to Perl from C was necessary for science reasons.
The clear truth is that not needing to worry about memory allocations or basic datastructures allowed many more programs to be created. Sure you lose some efficiency at the CPU level but you gained a massive efficiency at the programmer level. This is great because the programmer is getting more and more expensive every year while a CPU ticks is decreasing in price all the time.

Back to the NCBI where large databases keep on growing in size and even worse complexity. It is financially impossible for a small lab to fully integrate the knowledge contained in RefSeq or UniProt into their own data infrastructure. Especially if we include the need to keep their data up to date. Just understand that these databases are nearly terabytes in size when uncompressed and stored in a relational database and have a 100+ interlinked tables. And this is just 2 of the large-ish public databases. Even if you think this work is trivial why would the NIH pay hundreds of small labs to this work over and over again? And not just the NIH but all the other funding agencies? If they could fund 2 SPARQL endpoints that all of their users could use? Is this not a form of useful cloud computing?

But of course you could say just make your SQL database available like UCSC for their genome browser. Many bioinformaticians would cheer this on. Yet there is one thing SPARQL has that SQL does not. SPARQL is practically standardized SQL is theoretically standardized. See the differences between DB2, Virtuoso, Vertica, Oracle, MySQL and Postgresql in practical terms. Is it "show tables" or "SELECT table_name FROMuser_tables". Oh it was LIST TABLES ;(.
Many SQL vendors don't even commit to supporting the ISO sql standards.

Compare this to the SPARQL world. IBM and Oracle both fully support SPARQL 1 (Oracle even using 2 databases! Spatial and NoSQL) as well as Yarcdata (Cray), BigData (Systap), Virtuoso, Apache software foundation, Sesame (2), Ontotext, Clark&Parsia, Markdata and many more I can't think off. And for each the show tables is equivalent "SELECT DISTINCT(?type) WHERE (?s a ?type)". In 5 years since the standardization we actually have a lot of products that support the whole of the SPARQL standard, something that the SQL world has not managed in 21! I expect that of the above list at least 8 will be fully SPARQL 1.1. compliant by the end of summer.

This means that a choice for a SPARQL database by the NCBI does not favor any database company. Also one team may have certain requirements of their datastore that others do not. Yet all of the datastores present the exact some API to your users:SPARQL. Which means that if RefSeq needs solution A then the Pubmed team can use solution B without negatively impacting your querying users.

Lastly as my included presentation shows the final killer feature is the SERVICE keyword. Need to do analytical queries over two databases (see the presentation at [2] for some examples)?
No need to download all data just use their sparql endpoints and
federated queries. In this case we used 2 different SPARQL solutions. UniProt using OWLIM and I think ALLIE and ChEMBL using virtuoso. The same works for querying between UniProt and Nature citation data even though their endpoint is using software from The Stationary Office (5Store, hah I could think of one more).

Then what about the popularity of XML or RDF. RDF for UniProt is close to matching the popularity of XML and might have exceeded it (I will have to look at the latest logs). While the sparql endpoint only gets a 3500 queries a day, its not been advertised or even linked from the main uniprot.org website. This will stay this way for as long as the sparql endpoint is in beta (as the hardware started throwing ram errors last week it might be a while ;( ) . Yet, those queries are answered and most of them could not be answered with our full text indexes on uniprot.org or even our production SQL databases. Most importantly the SPARQL endpoint saved my bacon when a SAB member needed some very specific data
pronto.

Of course you have one important question and that is what does it cost to provide a SPARQL endpoint? This is a good and valid question. The answer of course depends...

On a greenfield project I think given comparable experience among your staff RDF+SPARQL is cheaper and more performant than a SQL approach.

Why is SPARQL cheaper than SQL when starting from scratch?
1. The graph nature of a SPARQL endpoint allows you to use it as a
key-value store for your data at the same time as using it for your
complex searches.
2. JSON-LD and SPARQL/JSON gives you a cacheable api for your
Web2.0/Ajax website to use without custom programmer development.
3. You do not need to design a separate data interchange format you can just use RDF.
4. Competitively tendering, moving from one SPARQL endpoint software to another is days work. i.e. you get the same answer the only difference is the speed at which you get the answer.
  Even using JPA or hibernate evaluating many SQL stores is not that easy!

Of course greenfield programming is rare and won't be the case for most projects at the NCBI. Yet even for old projects providing SPARQL/RDF can be worth it. Firstly its not that expensive to provide RDF besides your existing XML. One intern can make a great XSLT in a few months. You can make your SQL database available via SPARQL mapper. Even writing a SPARQL wrapper against CSV files is easy (days work for a good programmer).

There are risk and costs involved in starting down the semantic web. The first risk is to introduce more semantics that your data. i.e. instead of converting from one (e.g. ASN.1) serialization to RDF you try to redesign your whole data model. The second risk is that you assume you can throw out your old infrastructure once the SPARQL based one is live.
Assuming you can easily replace years of IT infrastructure in e.g.
GenBank with a year of work on a SPARQL endpoint is false. I think it is relatively cheap to complement the existing infrastructure with simple direct RDF and SPARQL. The reality is that
a format once published needs to be supported for a long time.

Will the choice for SPARQL affect all your users in their day to day
work. No, its just a nicer pipette for the data analysts. They are still
going to complain about your data modeling, the bizarre exceptions from 1981 that were never fixed. That their queries are to slow and your documentation is useless. We are dealing with humans here, we can make things easier but they will still be struggling with the really hard parts of data quality.

To conclude:
1. SPARQL is just cheaper for users than SQL or traditional solutions (if those solutions don't exist yet).
2. The ideal SPARQL world is closer to data heaven than the ideal SQL world.

Hope you can all use something of this long rant.

1. Fasta, Fasta (canonical), flat-file, gff3, xml, CSV,  excel, list
2. http://www.slideshare.net/jervenbolleman/uni-protsparqlcloud
1
1
Bob Morris's profile photo
Add a comment...
People
Have him in circles
126 people
Rishi Ramgolam's profile photo
Helena Deus's profile photo
Ian Foster's profile photo
Gudmundur Thorisson's profile photo
Reinout van Schouwen's profile photo
Marco Roos's profile photo
Work
Occupation
Technical Software Architect at myGrid, University of Manchester
Employment
  • myGrid, School of Computer Science, University of Manchester, UK
    Software architect, 2006 - present
  • NTNU, Trondheim, Norway
    System developer, 2002 - 2005
  • Linpro, Trondheim, Norway
    Software developer, 2000 - 2002
  • Sandvik AS, Stavanger, Norway
    IT assistant, 1997 - 2000
Places
Map of the places this user has livedMap of the places this user has livedMap of the places this user has lived
Currently
Manchester, UK
Previously
Birmingham, UK - Trondheim, Norway - Stavanger, Norway
Story
Tagline
Workflows, digital preservation, semantic web, provenance, OWL, Java, Ruby, Python, Clojure
Introduction
Software developer at myGrid working on the digital preservation project Wf4Ever, the workflow system Taverna. Also involved with the W3C Provenance working group developing standards like PROV-O.

This profile is not to be confused with my geeky G+ profile or my personal G+ profile.
Education
  • NTNU
    MSc Computer Science, 2003 - 2006
  • NTNU
    BSc Computer Science, 1998 - 2003
  • University of Birmingham
    Visiting researcher, Computer Science, 2005 - 2006
Basic Information
Gender
Male
Other names
Stian Søiland