|
|
May 25, 2006
[permanent link]
|
|
Here are some of the proposals I have heard recently:
- name-value pairs
- name-value pairs in a hierarchical arrangement (think LDAP)
- the relational model (think SQL)
- XML
- RDF and/or RDF schema
- OWL
- UML and variants
and I'm sure I'm forgetting a few.
If this sounds like every possible model in the known universe of data representation,
that's probably because it is...
When faced with a situation like this (and this is entirely independent of the topic of digital
identity, but just good information modeling practice), what I tend to do is to determine whether
some of the proposed approaches ("meta-models" would be the more correct term)
are just inherently more powerful than others. Then I
use the most powerful one to express the information that needs expressing, and
after that, I see whether a simpler approach could have been used instead
of the more complex one. Let's see where that leads us.
Name-value pairs are clearly the simplest of the lot. Hierarchical name-value pairs
are a superset, as is XML. Both of those (without additional conventions) have an
inherently hierarchical structure, and it is hard to model general-purpose graphs
in them, like RDF/RDFS, OWL, UML and others can do, so the latter ones are more powerful.
RDF schema, OWL and UML in turn are more powerful than RDF by itself because not only can
they express graphs, like RDF can, but have
additional concepts like inheritance, entities vs. attributes etc. So those three
are similarly powerful (and the distinctions, for our purposes here, are largely
not very relevant).
So let's take an example set of identity data and model it accordingly. Let's say
I want to model Joaquin's and my business card information. We need to capture name,
company, job title, and business and private phone numbers. With a UML-ish mindset,
we'd find the following concepts:
- Person, with attributes such as Given and Last Name
- Company (which is probably a special case of a more general concept called Organization)
with attributes such as LegalName etc.
- Role, with attributes such as Name
- VoiceCommunicationsEndPoint (or whatever we want to call this), with an attribute called
PhoneNumber
- Some others which we'll ignore for this example in the interest of brevity
- associations between them.
For the above example, we are getting two instances of Person (Joaquin and Johannes). We
get one instance of Company (NetMesh). There are two instances of Role (CEO, Architect),
and a number of instances of VoiceCommunicationsEndPoint (for the various phone numbers we
are using.). There are relationships between Person Joaquin and Role Architect, between
Role Architect and Company NetMesh, between Person Johannes and Role CEO, between CEO and
Company NetMesh.
The associations between the VoiceCommunicationsEndPoint and the other objects are
interesting. My private phone number is clearly an association between Person Johannes
and the VoiceCommunicationsEndPoint, but my business phone number is most likely best
represented as an association between VoiceCommunicationsEndPoint and Role. This is because
my use of this VoiceCommunicationsEndPoint is contingent upon my being in the role at
NetMesh that I'm in; if I was fired, my use of the VoiceCommunicationsEndPoint would
go away, too.
This is a very tiny example, but it serves as a good illustration for how complex object
relationships can become very quickly in an identity scenario, and how rich a vocabulary
one would like to use.
So let's see how we would represent the same information using less powerful techniques
than UML / OWL etc.:
- It's clear that using RDF wouldn't be much of a problem. One can clearly express this information as
a graph, constructed from a set of triples. It loses the distinction between attributes
and classes, but with a well-managed schema, that is not really much of a problem
(tools support, verbosity etc. etc. are other issues, but we'll not deal with those
right now)
-
plain XML is a bit harder, for two reasons: 1) In our model, we have one instance of
company, which is related to two people. In other words, we have something that can only
be expressed in the hierarchical XML structure if we put the company at the top (as
LDAP does). Unfortunately, this structure breaks down as soon as one person has a
relationship to more than one company: we'd have to construct a workaround
to have two places in two XML files be declared to be synonyms (either representing
the person, or the company, depending on whether we put the company or the person
at the root of the hierarchy). This is one of the reasons directories are usually
only applied to employees of a single organization, by the way.
Fortunately, there is a trick that can get us out of this particular problem: if we
use URLs/URIs (which could be identity-enabled, by the way) in place of in-lined
company or person information, it is easy to determine
that two nodes in two different XML files represent the same person, or company,
and it is easy to look up information about it simply by dereferencing that URL.
So we can probably make plain XML work.
-
But name-value-pairs become really really ugly, just like in my
previous
example. It's easy to come up with name-value
pairs for a Person, or a Company, or a Role, or any of the classes. But it is very
hard, if not impossible, to represent the associations. How do you say, for example,
using name-value pairs, that Johannes works in the role of CEO for NetMesh, and
as part of this, may use this phone number? At the very least, we'd have to have
the ability for properties to carry two values, not just one (so we could represent
an association between, say, Johannes the Person and CEO the Role).
If we don't have these two pointers, and absent of say, re-constructing RDF in
name-value syntax, we can't represent the information in the richness that is
inherent even in this very simple example. Instead, we'd have to fudge, such as:
we don't allow the person to work for multiple companies, or we just call it
"work phone" and don't relate it to the role, or we leave it as an
exercise to the reader to determine that if Joaquin and Johannes work both for
a company whose name is "NetMesh Inc.", it must be the same company. None of
which is a good foundation to build trustworthy stuff on ... which is why I'm
calling it a dead end.
So my conclusion, which I'm sure you aren't surprised about given what we've done
in LID and InfoGrid, is that we can probably scrape by with XML, but not with
anything simpler. InfoGrid itself (the platform, of which identity is only one
component), is based on a richer approach that's closer to UML and OWL, and
with examples like this one,
it's easy to understand why. For InfoGrid LID, the identity layer, we chose
XML, and very conciously so, although this is the first time that the reasoning
has been written down ...
|
|
[permanent link]
Add to [del.icio.us]
|
|
|
May 25, 2006
[permanent link]
|
|
Is it non-overlapping? The same, or a subset? Is there overlap; if so, where, and under which
circumstances?
These questions are at the heart of the thought process that needs to get into
designing identity technologies for the era of pervasive identity. For example,
if the answer was "non-overlapping", then we could merrily go ahead
and design identity systems on the green field (at least with respect to
non-identity systems), not worrying about what is in, say, the customer database.
If the answer was "the same, or a subset",
then we'd better not design any identity systems, but instead devise methods by which
existing systems (such as transactional systems and other systems that were not
built for identity per se) can be recruited to also meet the requirements of identity.
I've found that when I'm asking that question, which I occasionally do, the answers
that I'm getting from people in the community often dramatically differ depending on
how much the person being asked has an "enterprise directory" background.
(and if you think about that for a minute, that makes some sense.) To exaggerate,
not being entirely fair here for a minute, that opinion sometimes
seems to be "what's in the directory is the identity information, and
all other systems aren't directories, so they don't have identity information."
with the conclusion of "little or no overlap". If that sounds like a
self-fulfilling promise to you, it certainly does to me.
In my experience, however, this traditional view is increasingly inconsistent with
what is happening, in terms of user behavior, in terms of the
shift
of power and control from centralized organizations with a firm wall around them to individuals,
and in terms of the new technologies and services that are springing up in response.
For example, the other day, I needed to get in touch with somebody whose phone
number I had lost, or never had had in the first place. I found it: by Googling his
name first, finding his blog, and doing a whois lookup on the DNS address of his blog.
It could have been through reverse phone number lookup by address at Google.
Or by going through LinkedIn. If we had happened to work for the same company
with a well-maintained directory, I could have gotten the number from there;
but we didn't.
Of course, this use case is not an enterprise use case. But that is the whole
point about pervasive, indepedent identity! It isn't tied into any one organization
or central repository of identity information. It is the non-enterprise use cases,
the "open internet" use cases of identity technology that are needed to be
addressed today because increasingly, the people we interact
with and relate to are outside of the confines
of the same organization; certainly outside of 9-5, which isn't what it used to
be, either. There is also the convenience factor: Google is a lot closer to the
fingertips of a lot of people than the enterprise directory application.
Note also that what is one person's identity data is somebody else's transaction data.
We certainly don't run MyLID.net
(our hosted LID, OpenID, Yadis identity service) on top of a directory, and
I'm pretty sure that is also true for many social web applications. Netflix's
social network functionality has a lot of identity-related data in it, but
they probably (conjecture on my part, I don't know) store it just in the same
database that all their other information is in: maintaining the relationships
between people and their purchases would be rather difficult if one introduced
the usual impedance mismatch between a directory and a database; the benefits
would be rather marginal, and Neflix does use transaction data as a form of
identity data in any case ...
The conclusion: a separation between these different kinds of data, and
allocation to different kinds of information systems with strict boundaries between
them, might have made sense in the past, and within a tightly structured IT
environment (and even then, show me the enterprise application that does not
have at least a bit of identity information in it). Today, on the open
web, with social software being one of the primary areas of innovation,
this separation is increasingly anachronistic, if it is performed for the purposes
of "separating identity data from transaction data". (There are good
other reasons, such as differing performance profiles. But conceptually, we
should be thinking about one tightly cross-referenced set of information, even
if we decide that data item A should rather sit in system B than C because it's
faster, or cheaper, or ...).
What we need, in the end,
is an approach that considers the entire web and enterprise IT infrastructure,
warts and all, one giant, distributed, decentralized meta-directory (or
meta-database, or ...) that has parts that are optimized for different
requirements, but that can be accessed uniformly so application development
"native to the web"
is possible. Identity data elements are a subset of all of that information, and tightly
related to other data elements, both identity and not. And that way,
we don't even need to draw an artificial line whether or not information item
X (say, somebody's presence or transaction record on eBay) is or isn't a piece
of identity information.
|
|
[permanent link]
Add to [del.icio.us]
|
|
|
May 25, 2006
[permanent link]
|
|
I proposed the following topic:
Let's role play as aspiring cyber criminals, 10 years from now, when there's
a ubiquitous digital identity layer that is part of all digital interactions
over the internet. How do we make a buck or a billion?
for dinner conversation at the upcoming Harvard/Berkman conference on digital
identity. Apparently they have a Berkman conference tradition called Food
for Thought dinners. These dinners are informal gatherings of about 8 people
who gather around discussion questions and are lead by one of the conference
panelists.
I wonder whether we'll find a killer business plan in there ;-) and if so, what
we can do about (ahem, against) it.
|
|
[permanent link]
Add to [del.icio.us]
|
|
|
May 25, 2006
[permanent link]
|
|
Last week, I posted
about why forcing identity data into name-value pairs is an
architectural dead end. Of the many
comments that I received,
those from Phil Hunt and Mark Wilcox, in particular, turned out to warrant a much more detailed
response than I initially thought. I realized that they are
raising a much broader topic that one could call "The Challenges of Open Data",
applied to the example application domain of digital identity. I hopefully will
get around to writing an article on the general case some time soon, but for now,
I'll focus on digital identity data. Because that is already complex enough, I have
broken down my thoughts into multiple posts, which will be published over a few days,
one at a time, and which, for convenience, will be linked from here.
I'm rephrasing the points that were made as questions (and hope I don't miss
anything really important) and also add a few related questions.
- "How is identity data different from other kinds of data, such as transactional data?
Where does one start and the other end? What overlaps are there?"
(go to separate
post)
- "What is a good way of thinking about the (conceptual) structure of identity information?
Is it RDF? Is it name-value pairs? Is it SQL? Is it ... [long list of potential
candidates]."
(go to separate
post)
- "Given that LDAP seems to work for identity data in many use cases, where does
the need for more complex structures for identity data arise, and what are those
more complex structures?"
(go to separate
post)
- "Applications written against LDAP directories from one vendor often do not
work against LDAP directories of other vendors. If we want to build a ubiquitous
identity layer on the internet, how are we going to solve this
problem?". This question is really about how to deal with multiple ontologies
of identity data.
(go to separate
post)
- "What is the best way of representing identity information for the purpose of
storage and retrieval?"
(go to separate
post)
- "What is the best way of representing identity information during exchange
on the open internet?"
(go to separate
post)
- "Can name-value-pair-based representation of identity information be "fixed"
with few additional conventions that add a lot of power? E.g. by extending
the allowed values (in the name-value pairs) to be "pointers" to other name-value
pairs?"
(go to separate
post)
|
|
[permanent link]
Add to [del.icio.us]
|
|
|
May 25, 2006
[permanent link]
|
|
Yesterday, we suddenly have been getting about one hit per second, on the same
web page from a host known as:
- http://blo.gs/ping.php
- 66.218.65.40
- blastws1.my.scd.yahoo.net
Why would it do such a thing? The page it hits has not even changed in over
three weeks! If anybody has an insights, I'd be very grateful if you could share them.
Sending a message through http://blo.gs/contact.php
so far has not produced a result. This host is run by Yahoo, and one would hope that
they both have their polling policies and their respond-to-questions policies
under control.
For now, our firewall is instructed to kill all incoming traffic from that
host.
|
|
[permanent link]
Add to [del.icio.us]
|
|
|