RDF and Linked Data Validation - ESWC'16 Tutorial

Slides

RDF Validation (slideshare)
Overview of RDF Data Model (slideshare)
ShEx by Example (slideshare)
SHACL by example (slideshare)
ShEx vs SHACL (slideshare)
Future work and applications (slideshare)

Examples and other material will be available at this repository

Abstract

RDF promises a distributed database of repurposable, machine-readable data. Although the benefits of RDF for data representation and integration are indisputable, it has not been embraced by everyday programmers and software architects who care about safely creating and accessing well-structured data. Semantic web projects still lack some common tools and methodologies that are available in more conventional settings to describe and validate data. In particular, relational databases and XML have popular technologies for defining data schemas and validating data. These currently have no analog in RDF.

Shape Expressions (ShEx) has been designed as an intuitive and human-friendly high level language for RDF validation.

In 2014, the W3c chartered a working group called RDF Data Shapes to produce a language for defining structural constraints on RDF graphs. The proposed technology has been called SHACL (Shapes Constraint Language) and a first public working draft has been published in October, 2015.

In this tutorial we will present both ShEx and SHACL using examples and RDF data modelling exercises.

Like the popular SPARQL by example tutorial, this tutorial includes step-by-step instructions with examples followed by exercises. Participants can download validation tools to use locally or use web-based interfaces like RDFShape or W3C ShEx Workbench.

Overview

RDF is growing in popularity for both data transfer and data storage/recall. In both of these capacities, it is important to describe and verify conformance with a particular graph structure. While the Semantic Web is an environment where anybody can say anything about any topic, we still need to make sure that clinical, genetic, manufacturing, etc. databases capture data in a predictable way.

When we record or exchange data, programs or human operators are expected to synthesize and interpret data. In order to safely process data, this additionally requires that the data maintains a specified structure and can be described by that structure.

Non-RDF data storage systems offer and rely on schemas both to increase data integrity and to enable efficient storage and static query analysis for optimization. SQL's Data Definition Language completely constrains what may appear in an SQL database (with minor exceptions like some databases that don't ensure homogeneity in a column). XML's use of W3C XML Schema and Relax NG typically involves validation on data creation and ingestion. Even JSON Schema is growing in popularity as that developer community recognizes the need for basic structural description.

RDF, and graph stores in general, don't demand an initial schema definition like SQL, but operate more like XML where the basic language allows many structural constructs but specific applications impose further practical demands. In that sense ShEx and SHACL work with the open spirit of RDF (natively schema-less), while giving developers and data architects a tool to impose and validate some specific constraints.

The practicalities of data exchange faced by the Open Services Life Cycles collaboration lead to the development of Resource Shapes, a language for communicating the data structures managed by Linked Data Platform endpoints. Likewise, the Dublin Core defined Description Set Profiles for describing constraints and expectations about bibliographic records. None of these underwent a standardization and implementation phase leading to widely deployed, general-purpose validation tools.

The current work developed by the Shape Expressions community and W3c Data Shapes Working group may help to improve RDF adoption in industrial scenarios where there is a real need to ensure the structure of RDF data, both in production and consumption.

More information about ShEx is available at the ShEx Primer and about SHACL at the First Public Working Draft.

Topics

RDF Data Model
Shape Expressions
- Shapes and basic triple constraints
- Groupings and cardinality
- Value references, recursion and negation
- Semantic actions
SHACL
- Shapes
- Scopes
- Constraints: property constraints, property pair constraints, constraint operators: Or, And and Not
- Other features: SPARQL based constraints and templates
Applications and use cases: Linked data portals, Clinical data, etc.

Goals

Users will understand use cases for RDF validation.
Users will be able to create their own RDF data shapes or schemas and validate instance data against them.
They will see how RDF validation works in ShEx and SHACL.
Users will understand some advanced validation scenarios like cyclic data models, negations or recursion as well as more complex definitions.
Hands-on experience will leave them comfortable using existing tools to solve practical needs in communicating schemas and verifying instance data conformance.
Time-permitting, we will dive into the details of the underlying technologies and algorithms for RDF validation.

Audience

The audience should be comfortable either with using git and a JVM or javascript VM like node, or just their web browser. A rudimentary knowledge of RDF and Turtle is expected. Like SPARQL by Example, this is intended to introduce the audience to a new (to them) language.

Tutoring team

Jose Emilio Labra Gayo. Associate Professor, University of Oviedo, Spain. He is the main researcher of the WESO research group. He is a member of the RDF Data Shapes working group and the chairman of the Best Practices for Multilingual Linked Open Data community Group. He implemented a Shape Expressions library in Scala called ShExcala and maintains the online RDF validator service RDFShape.
Eric Prud'hommeaux. W3C staff contact for the Health Care and Life Sciences Interest Group, RDF Data Shapes (RDF Validation), LDP, RDF 1.1, SPARQL 1.1, RDB2RDF, SPARQL 1.0, SAWSDL and XML Protocol Working Groups. He has developed and designed multiple languages, including a significant contribution to SPARQL and ShExC (Shape Expressions Compact syntax). He developed the Fancy ShEx Demo to promote understanding about and exploitation of ShExC.
Harold Solbrig, Mayo Clinic, USA. Harold has been involved in computational semantics and information modeling since the early 1970's. He represents the Mayo Clinic on multiple standards organizations including Health Level Seven (HL7), the World Health Organization (WHO), the International Standards Organization (ISO), the Object Management Group (OMG) and the World Wide Web Consortium (W3C). His focus has been standardized models and API's for terminological resources and the tools to represent them in clinical and biomedical data. He is currently working the representation of UML and constraint based modeling paradigms as RDF constraints and the linking of RDF data with ontological resources. Mr. Solbrig has helped develop the ShEx language and is an active participant in the RDF Data shapes working group.
Iovka Boneva Associate Professor at University of Lille, France, and member of the Links research project affiliated to Inria -- Lille Nord Europe and CRIStAL (Centre de Recherche en Informatique, Signal et Automatique). She has been working on expressive languages for describing and querying tree-structured data, and more recently on data exchange for relational and graph data. She is member of the W3C RDF Data Shapes Working Group, and has developed the theoretical foundations of the Shape Expressions Language.

Registration

To register, visit: http://2016.eswc-conferences.org/

Schedule

The tutorial will be given on 30th May, 9h-12:30h (see Conference Program)