A better LSID, part 1
Those of you familiar with bioinformatics might know about the Life Sciences Identifier (LSID) specification, which describes a URN-based identifier (think “primary key”) for life sciences data objects (genes, proteins, microarray experiments, radiology images, clinical trial study calendars, etc.). I learned a lot about these over my last year and a half at Northwestern during my work with caBIG, because I was heavily involved with developing the first draft proposal for the use of data object identifiers within the caBIG grid. The intended benefits of LSIDs primarily include the following:
- location independence — an LSID identifies a resource, not a particular data record on a particular server on the network
- global uniqueness — the same LSID should never be used to refer to two distinct objects
- local assignment — there must be some easy mechanism for data object creators to assign globally unique LSIDs to their data without the need for onerous bureaucracy
- permanence — once assigned, an LSID cannot be reassigned to a different object
- semantic opacity — data clients who use LSIDs to refer to data are not supposed to read into the substructure of the LSID and make any conclusions about what the LSID means, other than being a key for a particular data object, e.g. urn:lsid:frank:dog:38922 should not be assumed to have anything to do with Frank or his dog.
- data and metadata — in addition to the bytes representing the data object identified, there is a separate “channel” of information, the metadata, which can contain information about relationships between the given data object and other data objects.
and…, some other stuff, see the official spec. LSIDs look like this: urn:lsid:ncbi.nlm.nih.gov:pubmed:9486653. One of the key steps in employing LSIDs on a grid (or in any network accessible way) is the deployment of some kind of resolution service, a server or set of servers that delivers the data for a given LSID to a requesting data client. Also, as a corollary of the “global uniqueness” requirement above, it is important that requesting resolution of a given LSID should always return the same set of bytes, regardless of when the resolution occurs. That is, if you assign an LSID to an object, you assign it to a specific set of bytes representing that object, and that association can never change (although the data provider might stop providing the data). This allows you to cache these objects without worrying about cache invalidation (since the bytes will never change) and more importantly enables reproducible research: a computational researcher can publish results based on a dataset identified by a bunch of LSIDs, and another researcher later can run the same computations on those same LSIDs and expect to arrive at exactly the same results.
Importantly, the metadata referred to by an LSID is not required to be byte-identical, so it has a bit more flexibility to cover some important use cases, as we will see in my next post on the subject.