Johannes Ernst's Blog [XML]  [LID]

What is a good conceptual model for identity data?

Here are some of the proposals I have heard recently:

  • name-value pairs
  • name-value pairs in a hierarchical arrangement (think LDAP)
  • the relational model (think SQL)
  • XML
  • RDF and/or RDF schema
  • OWL
  • UML and variants

and I'm sure I'm forgetting a few.

If this sounds like every possible model in the known universe of data representation, that's probably because it is...

When faced with a situation like this (and this is entirely independent of the topic of digital identity, but just good information modeling practice), what I tend to do is to determine whether some of the proposed approaches ("meta-models" would be the more correct term) are just inherently more powerful than others. Then I use the most powerful one to express the information that needs expressing, and after that, I see whether a simpler approach could have been used instead of the more complex one. Let's see where that leads us.

Name-value pairs are clearly the simplest of the lot. Hierarchical name-value pairs are a superset, as is XML. Both of those (without additional conventions) have an inherently hierarchical structure, and it is hard to model general-purpose graphs in them, like RDF/RDFS, OWL, UML and others can do, so the latter ones are more powerful. RDF schema, OWL and UML in turn are more powerful than RDF by itself because not only can they express graphs, like RDF can, but have additional concepts like inheritance, entities vs. attributes etc. So those three are similarly powerful (and the distinctions, for our purposes here, are largely not very relevant).

So let's take an example set of identity data and model it accordingly. Let's say I want to model Joaquin's and my business card information. We need to capture name, company, job title, and business and private phone numbers. With a UML-ish mindset, we'd find the following concepts:

  • Person, with attributes such as Given and Last Name
  • Company (which is probably a special case of a more general concept called Organization) with attributes such as LegalName etc.
  • Role, with attributes such as Name
  • VoiceCommunicationsEndPoint (or whatever we want to call this), with an attribute called PhoneNumber
  • Some others which we'll ignore for this example in the interest of brevity
  • associations between them.

For the above example, we are getting two instances of Person (Joaquin and Johannes). We get one instance of Company (NetMesh). There are two instances of Role (CEO, Architect), and a number of instances of VoiceCommunicationsEndPoint (for the various phone numbers we are using.). There are relationships between Person Joaquin and Role Architect, between Role Architect and Company NetMesh, between Person Johannes and Role CEO, between CEO and Company NetMesh.

The associations between the VoiceCommunicationsEndPoint and the other objects are interesting. My private phone number is clearly an association between Person Johannes and the VoiceCommunicationsEndPoint, but my business phone number is most likely best represented as an association between VoiceCommunicationsEndPoint and Role. This is because my use of this VoiceCommunicationsEndPoint is contingent upon my being in the role at NetMesh that I'm in; if I was fired, my use of the VoiceCommunicationsEndPoint would go away, too.

This is a very tiny example, but it serves as a good illustration for how complex object relationships can become very quickly in an identity scenario, and how rich a vocabulary one would like to use.

So let's see how we would represent the same information using less powerful techniques than UML / OWL etc.:

  • It's clear that using RDF wouldn't be much of a problem. One can clearly express this information as a graph, constructed from a set of triples. It loses the distinction between attributes and classes, but with a well-managed schema, that is not really much of a problem (tools support, verbosity etc. etc. are other issues, but we'll not deal with those right now)
  • plain XML is a bit harder, for two reasons: 1) In our model, we have one instance of company, which is related to two people. In other words, we have something that can only be expressed in the hierarchical XML structure if we put the company at the top (as LDAP does). Unfortunately, this structure breaks down as soon as one person has a relationship to more than one company: we'd have to construct a workaround to have two places in two XML files be declared to be synonyms (either representing the person, or the company, depending on whether we put the company or the person at the root of the hierarchy). This is one of the reasons directories are usually only applied to employees of a single organization, by the way.

    Fortunately, there is a trick that can get us out of this particular problem: if we use URLs/URIs (which could be identity-enabled, by the way) in place of in-lined company or person information, it is easy to determine that two nodes in two different XML files represent the same person, or company, and it is easy to look up information about it simply by dereferencing that URL.

    So we can probably make plain XML work.

  • But name-value-pairs become really really ugly, just like in my previous example. It's easy to come up with name-value pairs for a Person, or a Company, or a Role, or any of the classes. But it is very hard, if not impossible, to represent the associations. How do you say, for example, using name-value pairs, that Johannes works in the role of CEO for NetMesh, and as part of this, may use this phone number? At the very least, we'd have to have the ability for properties to carry two values, not just one (so we could represent an association between, say, Johannes the Person and CEO the Role).

    If we don't have these two pointers, and absent of say, re-constructing RDF in name-value syntax, we can't represent the information in the richness that is inherent even in this very simple example. Instead, we'd have to fudge, such as: we don't allow the person to work for multiple companies, or we just call it "work phone" and don't relate it to the role, or we leave it as an exercise to the reader to determine that if Joaquin and Johannes work both for a company whose name is "NetMesh Inc.", it must be the same company. None of which is a good foundation to build trustworthy stuff on ... which is why I'm calling it a dead end.

So my conclusion, which I'm sure you aren't surprised about given what we've done in LID and InfoGrid, is that we can probably scrape by with XML, but not with anything simpler. InfoGrid itself (the platform, of which identity is only one component), is based on a richer approach that's closer to UML and OWL, and with examples like this one, it's easy to understand why. For InfoGrid LID, the identity layer, we chose XML, and very conciously so, although this is the first time that the reasoning has been written down ...

[permanent link]    Add to [del.icio.us