Whilst writing another article about loosely coupled systems and their data, I was struck by one internal reviewer’s comments about one uncontentious (at least to me) statement I made:
“Why store variables of state in a traditional relational database when we could use semantics and store each of them as triples in a semantic database - using an abstract GUID (or, better still, a DID) as the subject?”
The loosely coupled systems article was already getting a bit long, so I couldn’t go into any depth there. The loose end of the idea was still dangling, waiting for me to pull on it. So I pulled and… well, here we are in another article.
I’ve always liked origins. I love it when someone crystallises an idea in a “Eureka!” moment, especially if, at the time they have the idea, they have no clue on the wide ranging impact their idea will have. Think Darwin’s “On the Origin of Species”, Mary Wollstonecraft’s “A Vindication of the Rights of Woman” (1792). A particular favourite of mine is the “Planck Postulate” (1900) which led to quantum mechanics.
The origin story that’s relevant to all this is Tim Berners-Lee’s (TBL) “Vague but exciting…” diagram (1989) which was the origin of the world-wide web. I want you to take another look at it. (Or, if it’s your first look, be prepared to be awed about how much influence that sketch has had on humanity). There are a few things in that diagram that I want to highlight as important to this article:
…but don’t worry, we’re not going to get into the weeds just yet.
I want to introduce a use case and our main character. He’s not real, he’s an archetype. Let’s call him “Geoff”. Geoff is an example of a person in a vulnerable situation. Someone who could benefit from the Priority Service Register (PSR). Geoff lives alone and has the classic three health problems that affect our increasing ageing population: Chronic obstructive pulmonary disease (COPD), Type-2 Diabetes and Dementia.
Geoff’s known to a lot of agencies. The Health Service have records about him, as do the Police, Utilities (Gas, Water, Electricity), The Post Office, Local Government and, as he has dementia, his local Supermarket. They have a collective responsibility to ensure Geoff has a healthy and fulfilling life. To execute on that responsibility it’s going to mean sharing data about him.
Now we’re set in the arc of our story. We’ve got the four characteristics of these kind of problems:
Overarching cooperative goal
To expand: Data about Geoff is in various databases in different forms owned by assorted agencies. Each of those agencies has its own classifications for vulnerabilities, goals, aims and agenda - targets and KPIs to which they have to adhere - but remember they also have a joint responsibility to make sure Geoff is ok.
In this article, I’d like to weave the three aspects of TBL’s diagram with the four characteristics of the problem space and see if we can “Solve for Geoff” and improve his experience.
Let’s start with information management in distributed systems. The understanding of “Distributed Systems” has moved on since 1989. What we’re talking about here is a “Decentralised” system. There’s not one place (in the centre) where we can put data about Geoff. Everyone has some information about him and we need to manage and share that information for the good of Geoff.
If we imagine a couple of separate relational databases that have rows of data about Geoff, we’ll see there are two problems.
Two different versions of Geoff
Spotted them? They are:
The names of the columns are different
The identifying “key” data isn’t the same (Why would they be? They’re in different systems)
To generalise: 1. is about metadata - the data about the data; 2. Is about identity.
So, to metadata. In a relational database there is some metadata, but not much, and it’s pretty hidden. You’ve probably heard of SQL, but not of SQL’s cousin, DDL (data definition language). DDL is what defines the tables and their structure so the first example above would be something like:
Data Definition Language for the Persons table
What’s wrong with this? (I hear you ask). At least a couple of things. One is that there’s no description of what the terms mean. What does “Vulnerable” mean? And by defining it as a boolean, you’re either vulnerable or not. The other thing that’s very important in Geoff’s scenario is that this (incomplete and unhelpful) metadata is never exposed. You might get a CSV file with the column headings, and a word document explaining what they mean to humans, but that’s about it. Good luck trying to get a computer to understand that autonomously...
A part of Tim Berners-Lee's "Vague but exciting" diagram
I haven’t forgotten TBL’s diagram. In it, he hints at another way of describing data: using a directed graph. A graph has nodes (the trapeziums in his diagram) and edges (the lines with words on them). The directed bit is that they’re arrows, not just lines. He’s saying that there are a couple of entities: This Document and Tim Berners-Lee and that the entity called Tim Berners-Lee wrote the entity called This Document. (And, as it’s directed, The Document didn’t, and couldn’t, write Tim Berners-Lee.)
Skipping forward blithely over many developments in computer science, we arrive at the Resource Description Framework (RDF) which is a mechanism for expressing these graphs in a machine-readable way. RDF is made of individual “Triples”, each one of which asserts a single truth. The triple refers to the three parts: subject, predicate and object (SPO). To paraphrase, the above bit of graph would be written:
Subject Predicate Object
Tim Berners-Lee Wrote This document
We can translate Geoff’s data into RDF, too. The following is expressed in the “Terse Triple Language” (TTL or “Turtle”) which is a nice compromise between human and machine readability.
RDF version of Geoff's data
The things before the colons in the triples are called “prefixes” and, together with the bit after, they’re a shorthand way to refer to the definition of the property (like fibo:hasDateOfBirth) or class (like foaf:Person). Notice that I’ve hyperlinked the definitions. This is because all terms used in RDF should be uniquely referenceable (by a URL) somewhere in an ontology. Go on, click the links to see what I mean.
We’ve now bumped into one of the ideas that spun out of TBL’s first diagram: the Semantic Web. He went on (with two others) to describe it further in their seminal paper of 2001. All the signs were there in that first diagram, as we’ve just seen. Since then, the Semantic Web has been codified in a number standards, like RDF, with a query language, SPARQL, and myriad ontologies spanning multiple disciplines and domains of knowledge. The Semantic Web is often connected to the concept of “Linked Data”, in that you don’t have to possess the data: You can put a link to it in your data and let the WWW sort it out. foaf:Person from above is a small example of Linked Data - they’ve defined “Person” in a linkable way so I can use their definition by adding a link to it. We’ll get back to this in a bit.
There are so many great reasons for encoding data in RDF. Interoperability being the greatest, in my opinion. I can send that chunk of RDF above to anybody and they (and their computers) should be able to understand it unambiguously as it’s completely self-contained and self-describing (via the links).
There’s just not an equivalent in relational or other databases:
Keeping the data and the metadata together is a game-changer for sharing data.
That’s dealt with the first of our two problems outlined before, i.e. metadata. Let’s move on to identity. In my (and a lot of people’s) opinion identity wasn’t really considered carefully enough at the beginning of the internet. I don’t blame them. It would have been hard to predict phishing, fake accounts and identity theft back in 1989.
I put <some_abstract_id> in the RDF example, above, on purpose. Mainly because RDF needs a “subject” for all the triples to refer to, but also because I wanted to discuss how hard it is to think of what that id/subject would be. In RDF terms it should be a IRI as it should point to a uniquely identifiable thing on the Internet, but what should we use? I have quite a lot of identities on the internet. On LinkedIn, I’m https://www.linkedin.com/in/mnjwharton/ . On Twitter, I’m https://twitter.com/iotics_mark . In Geoff’s case, what identity should we use? He has two in my contrived example: “1234” and “4321” - neither of which have any meaning outside the context of their respective databases. I certainly can’t use them as my <some_abstract_id> as they’re not URLs or URIs.
To solve this problem, who we gonna call? Not Ghostbusters, but the W3C and their Decentralised Identifiers (DIDs). Caveat first. This isn’t the only way to solve identity problems, just my favourite. The first thing to know about DIDs is that they are self-sovereign. This is important in a decentralised environment like the internet. There is (rightly) no place I can go to set up my “internet id”. I can set up my own id, host it anywhere on the internet and, when it’s resolved (looked up in a database, for example), it will show you a short document. Here’s an example from the W3C spec - first the DID itself:
And then the document to which it points
I agree that it looks pretty complicated, but it isn’t really for regular humans. The important thing is that I can prove, cryptographically, that this id is mine as it has my public key and I can add proofs to it that only I can make (because only I have my private key). (Note for tech nerds. The document is in JSON-LD - the JSON serialisation of RDF). These documents are stored in a Registry (which in itself should be decentralised, such as a blockchain or a decentralised file system such as IPFS).
Let’s get back to Geoff. The <some_abstract_id> I put in earlier, can now be replaced by Geoff’s. I’ll make one up for him
Then we can use an excellently-named technique called Identity “smooshing” i.e. we can link all the other identities of Geoff that we know about using some triples. There are various we could pick.
foaf:nick - someone’s nickname
skos:altLabel - an alternative label for something
But I think that gist:isIdentifiedBy from the Semantic Arts’ Gist ontology is the best idea for Geoff. gist:isIdentifiedBy describes itself as:
"This is like a URI: a thing can have more than one ID, but each of the IDs must refer to a unique thing."
Perfect! Especially the bit about being able to have more than one.
Putting all the bits together, using decentralised identifiers, semantics and linked data we can have the self-sovereign id for Geoff linked to all his data and the other identifiers in other systems - all in one place and all self-contained and self-describing
Full RDF version of Geoff with links to other systems information
Tying all the threads together to conclude. TBL’s vision of the World Wide Web and the Semantic Web were, and remain, decentralised to their core. Clue’s in the name. “Web”. His original diagram had all the pieces (except for Id) - information management in distributed (now decentralised) systems using graphs. TBL even tried to rename the WWW the Giant Global Graph (GGG) to emphasise this. Now most people just bundle these technologies as Web 3.0.
We also managed to “solve for Geoff” - the diverse, customer-in-a-vulnerable-situation use case - by allowing all the parties to keep data about Geoff:
In their own systems
Using their own identifiers
In an interoperable way (i.e. in RDF triples) so they can share some/all of it.
I think of this not as a standard as such, but a standard approach. It’s like we all agree about the alphabet to use, but we don’t care so much about what you write using it.
Decentralised problems call for decentralised solutions and the mix of Semantics and Decentralisation allow everyone to keep control of their data about Geoff and to manage their part of the service mix, but also to share it with others in an interoperable way. At IOTICS, we call it “Digital Cooperation”.
I don’t really care what you call it, Semantics and Decentralisation go together like Fish and Chips, Beans on Toast, Strawberries and Cream. Strawberries are nice; cream is nice. But, together, they are more than the sum of their parts.