Pair programming and microarrays

July 9th, 2008

Yesterday I met with folks at Lawrence Berkeley labs. The PI entered the room, full of energy and clearly ducking briefly out of the fray to speak with us. Part of the discussion revolved around microarray experiments. We’ve all heard about how notoriously difficult it is to reproduce microarray experiments. People have proposed minimum information standards (really they’re guidelines) to combat this problem, and we’ve also all heard that often these standards aren’t enough. Even if people are following the guidelines, inevitably a crucial piece of information isn’t obviously critical and therefore isn’t communicated.

The PI noted that he has seen it to be helpful when more than one lab conducts an experiment, so that each can help the other avoid finicky and/or tacit experimental conditions that would prevent others from reproducing their results. I have wondered for some time (and for the case of microarrays in particular) whether the practice of “pair programming” that we use in software development would be more helpful than minimum information standards to increase the reproducibility of complex experiments. The problem with this, as the PI pointed out, is that duplicating every experiment can get expensive, and in the world of soft money (especially today’s world), people are always looking for ways to make the research dollar go farther. The possible long term efficiency of duplicating some efforts to increase data value and reduce a tendency to go down blind alleys might not be easy to quantify, and thus not easy to weigh quantitatively against the immediate penalty of “getting half as much work done”. (That’s certainly true in software.)

The PI pointed out that even if direct duplication was too expensive, he still advocated some kind of collaboration on experiments. In particular he advocated getting people together in the same room to look at the experiment together as it was being performed, so that the collaborator might catch important things that weren’t immediately apparent to the person performing the experiment. This, at most, only costs a small amount of travel funds.

I asked the PI if others shared his views, and he said that most of the larger microarray efforts had some sort of distributed work going on, but he wasn’t sure that this idea had been formalized anywhere.

I’m interested in this not only because of its parallel with software work, but also because I work for a company focused on facilitating collaborative science. I’m very interested in the different forms that scientific collaboration can take, and how best to help them along.

CDD community meeting on open R&D for developing world disease

May 6th, 2007

Last August I moved out to San Francisco to join a great cheminformatics startup, Collaborative Drug Discovery, as director of software development. Two months ago (March 5th) we had our first user community meeting on open R&D for developing world disease drug discovery. It was an inspiring event, both because of the evident energy of the community and because it made it so much clearer to me how important our customers’ work is.

Prof. Jim McKerrow at UCSF gave a nice overview of the scope of the work our customers face, and how collaboration (through CDD and otherwise) helps them arrive at cures sooner and more efficiently (the slides are blurry, so download them separately). We put up several other talks from the meeting on Google Video, available along with PDF slides from our website, including one by the famous medicinal chemist, Chris Lipinski, who is a member of our customer advisory board. Cool stuff.

A better LSID, part 1

April 30th, 2007

Those of you familiar with bioinformatics might know about the Life Sciences Identifier (LSID) specification, which describes a URN-based identifier (think “primary key”) for life sciences data objects (genes, proteins, microarray experiments, radiology images, clinical trial study calendars, etc.). I learned a lot about these over my last year and a half at Northwestern during my work with caBIG, because I was heavily involved with developing the first draft proposal for the use of data object identifiers within the caBIG grid. The intended benefits of LSIDs primarily include the following:

  1. location independence — an LSID identifies a resource, not a particular data record on a particular server on the network
  2. global uniqueness — the same LSID should never be used to refer to two distinct objects
  3. local assignment — there must be some easy mechanism for data object creators to assign globally unique LSIDs to their data without the need for onerous bureaucracy
  4. permanence — once assigned, an LSID cannot be reassigned to a different object
  5. semantic opacity — data clients who use LSIDs to refer to data are not supposed to read into the substructure of the LSID and make any conclusions about what the LSID means, other than being a key for a particular data object, e.g. urn:lsid:frank:dog:38922 should not be assumed to have anything to do with Frank or his dog.
  6. data and metadata — in addition to the bytes representing the data object identified, there is a separate “channel” of information, the metadata, which can contain information about relationships between the given data object and other data objects.

and…, some other stuff, see the official spec. LSIDs look like this: urn:lsid:ncbi.nlm.nih.gov:pubmed:9486653. One of the key steps in employing LSIDs on a grid (or in any network accessible way) is the deployment of some kind of resolution service, a server or set of servers that delivers the data for a given LSID to a requesting data client. Also, as a corollary of the “global uniqueness” requirement above, it is important that requesting resolution of a given LSID should always return the same set of bytes, regardless of when the resolution occurs. That is, if you assign an LSID to an object, you assign it to a specific set of bytes representing that object, and that association can never change (although the data provider might stop providing the data). This allows you to cache these objects without worrying about cache invalidation (since the bytes will never change) and more importantly enables reproducible research: a computational researcher can publish results based on a dataset identified by a bunch of LSIDs, and another researcher later can run the same computations on those same LSIDs and expect to arrive at exactly the same results.

Importantly, the metadata referred to by an LSID is not required to be byte-identical, so it has a bit more flexibility to cover some important use cases, as we will see in my next post on the subject.

Agile bioinformatics paper

June 18th, 2006

I don’t know why I didn’t post this before, but at the end of last month BMC Bioinformatics posted a provisional version of our paper on agile software methods in bioinformatics. The good news is people seem to be reading it. It is the journal’s #3 most viewed paper in the last thirty days! Give it a read and let me know what you think.

SciLnk

April 20th, 2006

Some friends of mine have been working on an interesting new startup called SciLnk. Essentially, the site is LinkedIn for life scientists. It will also allow you to browse pubmed abstracts via the network of authors. Often when you’re researching a subject you want to read all the papers published by a particular person and his or her collaborators on the topic, and it’s not as easy as it should be to collect everything via PubMed. SciLnk hopes to leverage the people network to improve the literature browsing experience and vice versa. There are lots of other interesting directions to take such a resource: improving the conference-going experience, grant searching/suggestions, job searching, etc.

They recently posted screenshots (and, more recently, these) of the product under development, and you can sign up to get in on internal beta testing at scilnk.com.

Full disclosure: I was personally working with the SciLnk team in its early days, but quickly realized I had bitten off more than I could chew, what with having a full-time day job and not living in Boston with the rest of the crew. I no longer have any financial interest in the company, I just think it’s a cool idea.