Skip to main content

Content-Addressing Semantic Data

Content-Addressing Semantic Data
·
Contributors (1)
Published
Oct 24, 2019

This document describes alternative approaches to identification and reification in RDF based on content-addressing canonicalized datasets.

A basic understanding of RDF is assumed, especially the concepts of blank nodes and RDF datasets, but skimming the RDF Primer should be enough to get by. JSON-LD is also mentioned but the details aren’t important; for our purposes it’s “just another RDF serialization”.

For simplicity, the distinction between IRIs and URIs is ignored.

Background

Blank nodes in RDF are anonymous nodes that don’t have any semantic identifier, and they’re a blessing and a curse. In practice most serializations end up needing locally-scoped labels to actually represent them, although in JSON-LD, every JSON object that doesn’t have an explicit "@id" property is implicitly interpreted as a new, distinct blank node. So the JSON-LD document

{
  "http://schema.org/name": "John Doe"
}

is equivalent to the N-Triples file

_:foo <http://schema.org/name> "John Doe" .

and they both encode the graph

The _:foo label in the N-Triples file only exists to distinguish it from other blank nodes in the same dataset. It’s not expected to have any relation to other _:foo nodes in other datasets.

Uses

Blank nodes are useful for representing data structures that don’t fit into a graph model well, like linked lists, where the elements must be ordered:

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> "fee" .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> _:b1 .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> "fi" .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> _:b2 .
_:b2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> "fo" .
_:b2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> _:b3 .
_:b3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> "fum" .
_:b3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> .

We don’t want to come up with URIs for every link in our linked list, so we use blank nodes to structure our data without them.

Motivation

But sometimes we want URIs for blank nodes, even though they were invented to avoid that! Suppose you came across the graph in the first example:

Maybe someone emailed it to you, or you found it on the ground, or a USB stick fell out of the sky. And suppose you wanted to reference the blank node in a graph of your own:

Hey, I think I know that guy! Here are some more facts about him…

There’s no clear way to do this! Ideally we’d be able to eat our cake and have it too: blank nodes should be a convenience for authors, but also shouldn’t prevent future authors from linking to them unambiguously. But how?

Skolemization

The RDF spec has a section on “skolemization”, which is what they call the process of replacing blank nodes with globally unique URIs:

In situations where stronger identification is needed, systems MAY systematically replace some or all of the blank nodes in an RDF graph with IRIs. Systems wishing to do this SHOULD mint a new, globally unique IRI (a Skolem IRI) for each blank node so replaced.

The spec prescribes minting well-known URIs with the registered name genid to get URIs that look like this:

http://example.com/.well-known/genid/d26a2d0e98334696f4ad70a677abc1f6

It’s hard to tell exactly what kind of situation is meant by the vague reference to “systems” in the spec. This is probably intentional - the authors don’t want to make assumptions or prescriptions about how RDF data will be used - but some centralized entity will have to do the work of choosing a domain name and generating the unique identifiers. This is an unfortunate fallback for a data model designed for the open web!

Problems with skolemization

Consider a prototypical user story: a library publishes an RDF dataset describing their catalog, and uses linked lists to represent the sequence of books in a series. One book series was adapted into a movie, and later, a movie database application wishes to reference one of the library’s linked lists in an RDF dataset of their own. Unless the library anticipated this need and generated well-known URIs to replace their blank nodes, the movie application has no way of referencing them.

On the other hand, if the movie application generates well-known URIs for the library’s blank nodes, then there’s no effective link to the library’s dataset at all. Out of context, the well-known URI is no more useful than another blank node, since it leaves a user with no way to find out more about its referent from other sources, which is the fundamental premise of linked data.

Indexing datasets

Since blank nodes are strictly “scoped” to a dataset, the problem of identifying blank nodes reduces to the problem of identifying datasets. Given a URI for a dataset in a serialization that assigns local labels to every blank node, we could address blank nodes using fragment identifiers.

For example, if everybody in the world knew that the URI http://example.com/a-very-special-dataset referred to the serialized dataset

_:foo <http://schema.org/name> "John Doe" .

… then we could globally identify the blank node labelled _:foo with the URI http://example.com/a-very-special-dataset#_:foo. It’s important that the specific serialization is the thing identified by the root URI, since the

_:bar <http://schema.org/name> "John Doe" .

expressed the same abstract graph using different blank node labels. But if we have a URI for the specific serialization, then we’re free to use the labels in the fragment identifier without ambiguity.

This interpretation of fragment identifiers isn’t prescribed in the N-Quads spec but is compatible with the fragment identifier semantics in RFC 3986:

The fragment identifier component of a URI allows indirect identification of a secondary resource by reference to a primary resource and additional identifying information. The identified secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations.

… and is more explicitly encouraged by Tim Berners-Lee’s personal view:

The fragment identifier on an RDF (or N3) document identifies not a part of the document, but whatever thing, abstract or concrete, animate or inanimate, the document describes as having that identifier.

And this fits with all of our intuitions about fragment identifiers on the web. When browsers fetch a web page, they don’t send the URL fragment to the server: they request the whole page, and then use the fragment to index a particular DOM element within the document. Similarly, blank node labels are not identifiers so much as indices into documents - they require the context of the entire dataset.

URIs for datasets

Sometimes a dataset has a clear URI that is used to refer to it. Other times a dataset is available at an HTTP URL, but the URL is a mirror, and there is some other canonical URL that would be more appropriate to use as an identifier. And in some situations there may be no obvious URI at all, like the case of finding a dataset on a USB stick or attaching one to an email.

This may seem contrived, but decoupling data from its host is critical for its long-term utility. We can’t depend on the persistence of URLs, servers, or even organizations, and tying the usability (however marginal) of a dataset to their stability would be irresponsible.

Instead, the future-facing archive-friendly durability-first vision is to treat the container of RDF data as completely self-contained. This is not to say that there aren’t links to other objects - but rather that the interpretation of the data isn’t dependent on runtime results of attempts to dereference URLs. Datasets should speak for themselves!

There are many facets to this vision (such as linked data signatures for in-band self-authentication to replace the web’s origin-based authority model), but the one in focus here is identity. We’d ideally like a way of identifying a dataset that is independent of its host and even independent of its serialization: derived purely from its semantic content.

This is possible through the combination of two building blocks:

  • A canonicalization algorithm that produces a deterministic normalized serialization of any dataset, such that isomorphic datasets produce identical serializations

  • A content-addressing scheme that assigns a unique, deterministic URI to any serialization (i.e. file)

Canonicalization

The former is by far the harder of the two, but fortunately the Credentials W3C Community Group has put an enormous amount of effort in publishing a RDF Dataset Normalization spec which achieves exactly that. In particular, the URDNA2015 canonicalization algorithm involves a renaming of blank node with deterministic labels, derived from the structure of the graph and a lexicographic comparison of the URIs and literals.

For example, the N-Quads documents

_:foo <http://schema.org/name> "John Doe" .
_:foo <http://schema.org/jobTitle> "Firefighter" .
_:bar <http://schema.org/name> "Jane Doe" .
_:bar <http://schema.org/jobTitle> "Professor" .
_:bar <http://schema.org/knows> _:foo .

and

_:jane <http://schema.org/jobTitle> "Professor" .
_:jane <http://schema.org/knows> _:john .
_:jane <http://schema.org/name> "Jane Doe" .
_:john <http://schema.org/jobTitle> "Firefighter" .
_:john <http://schema.org/name> "John Doe" .

are equivalent, and both would normalize to the canonical N-Quads document

_:c14n0 <http://schema.org/jobTitle> "Firefighter" .
_:c14n0 <http://schema.org/name> "John Doe" .
_:c14n1 <http://schema.org/jobTitle> "Professor" .
_:c14n1 <http://schema.org/knows> _:c14n0 .
_:c14n1 <http://schema.org/name> "Jane Doe" .

Content-addressing

The simplest scheme that assigns a unique, deterministic URI to every file is one that just encodes the entire file inside of the URI:

data:application/n-quads;base64,XzpjMTRuMCA8aHR0cDovL3NjaGVtYS5vcmcvam9iVGl0bGU+ICJGaXJlZmlnaHRlciIgLg0KXzpjMTRuMCA8aHR0cDovL3NjaGVtYS5vcmcvbmFtZT4gIkpvaG4gRG9lIiAuDQpfOmMxNG4xIDxodHRwOi8vc2NoZW1hLm9yZy9qb2JUaXRsZT4gIlByb2Zlc3NvciIgLg0KXzpjMTRuMSA8aHR0cDovL3NjaGVtYS5vcmcva25vd3M+IF86YzE0bjAgLg0KXzpjMTRuMSA8aHR0cDovL3NjaGVtYS5vcmcvbmFtZT4gIkphbmUgRG9lIiAuDQo=

This is obviously unreasonable, but notice that beyond uniqueness and determinism, this URI scheme also gives a built-in way of recovering the content of dataset from the URI. This is the same result as de-referencing an HTTP URL! Not only do we get a decentralized way of all agreeing on the same identifier for a dataset and all of its blank nodes (no matter how or where we came across it), but we get an automatic way of retrieving the content of the dataset from that identifier.

We can expect datasets to be large, and data URIs are too long. However, we can turn unique long things into unique short things by hashing them! Two approaches are described here, each with their own tradeoffs.

CIDs and IPFS

IPFS is a decentralized filesystem that uses a content-hash naming scheme called Content Identifiers, or CIDs, to address files. CIDs are hashes prefixed by a few bytes of metadata, like the base encoding of the CID itself, the hash algorithm used, and the format of the data that was hashed. These self-describing metadata bytes are called Multiformats and a few of them are IETF drafts.

The IPFS network then acts like a giant name resolver for these CIDs, using a distributed hash table to associate CIDs with the network addresses of nodes who advertise that they’re hosting the associated file. This lets users request files from the network at large, without knowing the network address of any particular host.

URIs for files on IPFS use the dweb scheme, e.g. dweb:/ipfs/Qm….

Usage

A typical use case for someone running an IPFS daemon in the background might look like this:

$ cat data.jsonld
{
  "@context": {
    "@vocab": "http://schema.org/"
  },
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "knows": {
    "name": "John Doe",
    "jobTitle": "Firefighter"
  }
}
$ cat data.jsonld | jsonld normalize | ipfs add -Q
QmWGLNcTLXpbPPcEjjRx7PrzM4SDeKSNNDcmrHBZkx2X2Q

Then others can later refer to Jane with the canonical URI

dweb:/ipfs/QmWGLNcTLXpbPPcEjjRx7PrzM4SDeKSNNDcmrHBZkx2X2Q#_:c14n1

… which gives anyone who sees it the ability to dereference QmWGLNcTLXpbPPcEjjRx7PrzM4SDeKSNNDcmrHBZkx2X2Q into a real dataset, as long as somebody in the world is pinning it to their IPFS node.

$ ipfs cat QmWGLNcTLXpbPPcEjjRx7PrzM4SDeKSNNDcmrHBZkx2X2Q
_:c14n0 <http://schema.org/jobTitle> "Firefighter" .
_:c14n0 <http://schema.org/name> "John Doe" .
_:c14n1 <http://schema.org/jobTitle> "Professor" .
_:c14n1 <http://schema.org/knows> _:c14n0 .
_:c14n1 <http://schema.org/name> "Jane Doe" .

Of course, this isn’t a guarantee that it’ll resolve, but neither are HTTP URLs. IPFS at least lets content move between hosts and leverage mirrors without breaking addresses.

Drawbacks

One drawback to using CIDs to identify canonical datasets is that just given an IPFS URI dweb:/ipfs/Qm…#_:c14n0, there’s no immediate way to tell if the referenced hash is an N-Quads dataset or some other kind of file. We might be using IPFS URIs to point to PDFs or CSVs, and CIDs don’t have any way of encoding MIME type.

This is even mentioned as a feature of skolemization in the RDF syntax spec:

Systems may wish to mint Skolem IRIs in such a way that they can recognize the IRIs as having been introduced solely to replace blank nodes. This allows a system to map IRIs back to blank nodes if needed.

Whether this is an actual limitation depends on usage. Would there be systems that want to distinguish these “blank node URIs” from regular URIs - maybe fetching the datasets containing referenced blank nodes but ignoring other files? Maybe!

One way around this might be to use a URI scheme other than dweb specifically for canonicalized N-Quads CIDs, such as x:/ipfs/Qm…#_:c14n0. This is a radical act but might be justified if there isn’t much overlap in the way these “dataset links” and regular links were treated within a system.

Another scheme for content-addressing data is described the Cryptographic Hyperlink spec proposed by Manu Sporny last year, commonly called Hashlinks.

Hashlinks are very similar to CIDs and use mostly the same building blocks (multihash and multibase, but not multicodec) to generate a self-describing content hash of arbitrary files. The major difference is that instead of using a multicodec table to map the first few prefix bytes to a known “format”, Hashlinks support a whole extra URI element where users can encode arbitrary MIME type and user-defined metadata.

The metadata is also binary-packed (with CBOR) so that both the hash itself and the metadata appear as compact base58-encoded strings.

Usage

The spec is very new and not widely used, although it has been implemented as a JavaScript library for the Browser and NodeJS.

Using our example canonicalized dataset:

const hl = require("hashlink");
const jsonld = require("jsonld");

const data = {
  "@context": { "@vocab": "http://schema.org/" },
  name: "Jane Doe",
  jobTitle: "Professor",
  knows: {
    name: "John Doe",
    jobTitle: "Firefighter"
  }
};

const meta = { 'content-type': 'application/n-quads' };

jsonld
  .normalize(data)
  .then(data => hl.encode({ data, meta }))
  .then(uri => console.log(uri));

… which gives us another URI for Jane:

hl:zQmfVfBaKgFFkUtvnPjQ35khooRpbyrUK2wWq3SzvsKAavq:zkiKpvWP3HqVQEfLDhexQzHj4sN413x#_:c14n1

This is pretty verbose! But it does explicitly declare a MIME that lets us tell that it’s an RDF dataset “on sight”.

Drawbacks

Hashlinks are in an awkward position of doing a little too much - they’re positioned as an easy way to encode both a file’s hash and its associated HTTP URLs (see the spec for an example of URLs encoded in a Hashlink’s metadata). This is definitely convenient for use cases where a centralized authority publishes a dataset, but prevents two users who acquired the same dataset from different sources from independently generating the same identifier.

Using Hashlinks in a truly decentralized way means committing to a micro-standard on top of Hashlinks: no URLs, only an explicit content-type of application/n-quads. This has the small added benefit of eliminating a CBOR decoding step: the serialized metadata element of a “dataset hashlink” URI will always be zkiKpvWP3HqVQEfLDhexQzHj4sN413x.

And then there’s dereferencing - Hashlinks are only an identifier scheme, and have no relation to IPFS or any other kind of retrieval network. And since they use a slightly different encoding scheme than CIDs, they won’t look like the addresses IPFS generates and can’t be pasted into ipfs cat.

That said, it is technically possible to use two together, although it requires a non-standard use of IPFS. Typically when a file’s CID is computed during pinning in ipfs add, the actual bytes that get passed to the hash algorithm aren’t just the bytes of the file: the data first gets wrapped in a Protobuf struct (defined here) that allows “nodes” to both link to each other (i.e. directories) and carry byte data (i.e. files), and that serialized node struct is what gets hashed.

However IPFS does allow pinning “raw” files with a ipfs add --raw-leaves flag, so a revised usage of the ipfs CLI tool to create a Hashlink URI might look like:

$ cat data.jsonld 
{
  "@context": {
    "@vocab": "http://schema.org/"
  },
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "knows": {
    "name": "John Doe",
    "jobTitle": "Firefighter"
  }
}
$ cat data.jsonld | \
> jsonld normalize | \
> ipfs add -Q --raw-leaves | \
> ipfs cid format -f "%m" -b base58btc
zQmfVfBaKgFFkUtvnPjQ35khooRpbyrUK2wWq3SzvsKAavq

… which is identical to the hash element of the previous Hashlink example. Subsequent retrieval over IPFS in a JavaScript application might look like:

const CID = require("cids");
const multibase = require('multibase');
const jsonld = require("jsonld");
const ipfs = require("ipfs-http-client")();

const id = "zQmfVfBaKgFFkUtvnPjQ35khooRpbyrUK2wWq3SzvsKAavq";
const hash = multibase.decode(id);
const cid = new CID(1, "raw", hash);
console.log(cid.toString());

const context = { "@vocab": "http://schema.org/" };

ipfs.cat(cid)
  .then(file => jsonld.fromRDF(file.toString()))
  .then(doc => jsonld.compact(doc, context))
  .then(doc => console.log(doc));

… which gives us our original data back:

{
  "@context": { "@vocab": "http://schema.org/" },
  "@graph": [
    {
      "@id": "_:c14n0",
      jobTitle: "Firefighter",
      name: "John Doe"
    },
    {
      "@id": "_:c14n1",
      jobTitle": "Professor",
      knows: { "@id": "_:c14n0" },
      name: "Jane Doe"
    }
  ]
}

Bonus bait

Our original motivation here was to find a scheme for addressing blank nodes, but the practice of content-addressing canonicalized datasets is also more broadly useful.

Datasets as objects

The most obvious application is using dataset identifiers in RDF data to make reference about datasets as digital objects themselves. This could be use to make post-hoc claims about provenance, veracity, correction, or general annotation by the publisher or any third party.

Graphs as objects

Using blank node labels in fragment identifiers can be extended to index blank graph names as well - particularly convenient is the practice of using the empty fragment (e.g. dweb:/ipfs/Qm…# or hl:zQmf…:zkiK…#) to refer to the default graph.

Statements as objects

Lastly, since blank node labels in N-Quads always begin with _:, other forms of indexing could also use fragment identifiers without introducing ambiguity. For example, the historically most-difficult RDF construct to identify within RDF data is the RDF Statement (aka Triple aka Quad) itself, and a lot of effort has been sunk into proposing different methods of reification. But since URDNA2015 sorts each quad of the dataset into a normalized order, each statement can be uniquely and concisely identified by its integer line number in the serialized file, like x:/ipfs/Qm…#/81 or hl:zQmf…:zkiK…#/16.

This means that datasets can’t refer to their own statements, but this could also be considered a feature: you shouldn’t be able to reference a statement until it has, in fact, been stated. Content-addressing permanently binds statement URIs to the context of a dataset that may further (and significantly!) qualify their meaning.

Conclusions

The schemes presented here for addressing in RDF are specific applications of a general paradigm that marries elements of the decentralized web and semantic web movements. In particular, treating the RDF Dataset as an immutable container of graph data is fertile ground for new approaches to some of the semantic web’s long-standing pain points, like reification, dereferencing, archiving, and link rot.

Immutability brings its own set of challenges, especially around versioning and developer tooling. N-Quads are space-inefficient! Canonicalization is NP-complete! How do we make it easy for authors to update the datasets they publish, and how do we publish those updates? These are tough problems, but they buy us a more robust, permanent, and self-describing future.

Comments
2
Samuel Klein: How can we make these more robust to trivial changes? A mechanism for describing or hinting at equivalence classes would strengthen the brittle aspect of {random hash functions} + {pure content identification}.
Joel Gustafson: We definitely need infrastructure and tooling that help alleviate the perceived burden / significance / “weight” of publishing small edits (e.g. pubpub’s struggle with the word “publish”), but at the end of the day there’s no telling whether a change is actually trivial. ML might be an interesting tack to take but it needs to live on a level above content-addressing.