Johannes Ernst's Blog [XML]  [LID]

Why name-value pairs are a bad idea for the exchange of identity information

Some months ago, Sxip made the rather dramatic step of moving the Sxip protocols away from XML as the basic structure of identity information exchange to simple name-value pairs. Ever since, I have been pondering whether that radical simplification is on the right side or the wrong side of "as simple as possible, but not simpler".

As a regular reader of this blog, you probably know that I'm a big fan of simple, and so questions like this one are rather important to me. If this could really work, it would be a huge simplification and A Great Thing.

Unfortunately, I've convinced myself lately that that is one step too far, that this simplification won't work except in the simplest cases. While I'm a huge fan of building something very simple for the simplest cases first, and then adding more (optional) complexity as one goes towards more complex requirements, this approach does not work with name-value pairs: this is because when things do become more complex, it requires a break and a different design. In other words, it doesn't scale towards more complexity, and that makes the name-value pair design a dead end for digital identity information. Unfortunately, because I do like simple.

Before anybody starts shouting, note that I'm not making this argument for competitive reasons: both friends and competitors have been flirting with this direction. Instead, I consider this a fundamental issue that we all should try to resolve collectively, and this is why I'm blogging this.

So let me try to make my case. (I know from past experience that I won't make it well in this first iteration, as hard as I may try. And I have been thinking about this for some months already! But bear with me; as you argue back, I will hopefully improve my explanation and my case ...)

Let's start with the big picture. I think nobody will really argue that in the general case, all the world's information, regardless of purpose, is best represented as name-value pairs. We have relational databases, for example, not name-value databases, and for good reasons. We have object graphs, and hyperlinks, for other good reasons, instead of just a two-column spreadsheet for names and values.

As a simple example for the general case, think of the set of all your ancestors, descendants and their relationships. One could probably devise a scheme by which all of this information could be packed into name-value pairs, such as:

father: Joe
mother: Jane
father.father: Jim
father.mother: Jill
...
father.father.father: Jack
...
father.father.son[2]: My paternal uncle

This becomes ridiculous rather quickly: imagine you have to express that your mother's aunt married your father's great-uncle. You'd have to create a construct such as:

mother.father.mother.daughter[2].married.father.father.mother.father.son[5]: true

(maybe there is a simpler way to represent this with name-value pairs that currently doesn't occur to me, but even if there is, I don't think it's going to be much simpler than this because we are really trying to shoehorn something that is a directed graph into name-value pairs.)

Note that while it is complex already, this example is a rather academic; the world is much more complex than this because so much more information is related to people than just their ancestry information. If we tried to represent that additional information, too, the syntax and conventions needed would have to be even much more laborious... Exercise for the unconvinced reader: let's try to add their addresses and phone numbers, and past employers.

So if you agree with me that for the general (not specific to identity) case, name-value pairs are not a feasible way of representing many kinds of information, the question becomes:

While name-value pairs are not a feasible way of representing information in the general case, is the subset of information that is of interest in digital identity use cases simple enough that name-value pairs are sufficient?

This is the question I have been struggling with for some time ...

Well, of course I chose the ancestor-relatives example for a reason: it would be entirely conceivable that information of that kind plays a rather large role in many identity use cases. For example: "my great-uncle is a millionaire" or "I am the beneficiary of 62% of the trust set up by ... upon the death of ... who is currently 98 years old." or "I have not been fired for cause by my last N employers" or "none of the members or visitors in the N on-line communities that I frequent has ever made the statement that I spam". You can make up more examples of this kind: what's central to all of them is that we are talking about structured information.

Granted, these examples are not the type of identity information that people typically focus on today. But I'd argue that this is not because this kind of information isn't important for digital identity use cases, but because digital identity is in its infancy and we naturally deal with simple kinds of information first, such as street addresses, credit card numbers, date of birth, that kind of thing. But I'm quite certain that this kind of information is going to become relevant very quickly, and that it will produce much higher business value than, say, knowing the zip code of a person. For example: which one convinces you more that Joe has a good reputation: a street address where Joe lives, or the knowledge that Joe comes from a family of millionaires and industrialists that stretches back over four generations?

So complex, structured information that at its heart is best thought of as graphs is going to become very important for many digital identity use cases in the future, and we need to work under the assumption that such complex information is not going to be a weird corner case, but a high-value case.

If you agree with me so far, let's assume for a second we'd start our quest for ubiquitous identity protocols with the plan of first using name-value pairs (because it is simple), and when we do need to represent more complex stuff, we move to XML, RDF, or some other kind of mechanism that can represent more structured information. The trouble is that we will end up with a fundamentally different representation of information than name-value pairs which is not down-ward compatible, e.g. such as:

father: Joe
mother: Jane
..
marriages: <set>\
 <marriage>\
  <husband id="1234"/>\
  <wife id="4567"/>\
 </marriage>
ancestry: <ancestor>\
 <father>\
  <person id="1234"><male/><first>Joe</first></person>\
  <father>\
   <person .../>\
  </father>\
  ...\
 </father>\
 <mother>\
  <person id="5678">Jane\
 </mother>\
</ancestor>

Information representation doesn't get much uglier than this mixture of simple name-value pairs and XML, in my experience, and I feel with any poor soul who has to debug or support this in production deployments.

Ergo: let's please not do name-value pairs for new protocols, but let's assume it is going to be more structured than name-colon-value.

Coincidentally, there is good precedent for that already: RSS, OPML, Atom etc. It would have been quite easy to design a format with the same capabilities as RSS that was name-value-based (at least quite easy compared to some of the contortions one would have to go through for the examples above). But neither Netscape nor Dave Winer nor the Atom guys nor anybody else that I know of seriously considered that. This should give us pause: if their information, that is simpler in structure than much of what we need for identity, needs to be represented in XML (or, some people would argue, RDF), chances are that simple name-value approaches, although appealing on first sight, simply won't work as soon as people adopt them for not-quite-trivial use cases..

I will have a constructive suggestion for what to do some time soon (in the meantime, look what we do with the LID Profile for Traversals and Selections), but for the time being, I'd like to only focus on my case that name-value-pairs won't work, and I'm looking forward to your (dis)agreements.

Joaquin Miller saw a draft of this post yesterday, and commented:

You are definitely right. The point in brief is:

Data is not a collection of data elements; instead data has structure.

Yep. That is also true for identity data, and that's why we need something else than name-value pairs as the foundation to represent it.

Update: Chuck Mortimore, who demonstrated his open-source version of InfoCard-In-The-Browser at IIW, says he "completely agree[s]", as have several people who have sent me e-mail privately so far.

Update: John Panzer points out that some people have argued that one could do RSS without XML. Sure one can, just like one can do the contortions in my examples above ... Certainly, the contortions for RSS would be much simpler than the ones that we would have to go through for more complex identity.

[permanent link]    Add to [del.icio.us

At TieCon Today

As I have done for some years now, I'm again visiting TieCon, the ever-growing entrepreneurshop conference put on anually by Tie. There are more than 3000 people here.

John Doerr is just pointing out how the face of Kleiner Perkins has changed over the past five years: from "white male" only to something that is much broader.

[permanent link]    Add to [del.icio.us