Refactoring bio-ontologies

October 7th, 2005

Recently I came across a reference to this article from Nature Biotechnology about the insufficiencies of the current implementations of biological ontologies. The article points out that most if not all current ontologies used or being developed for computational biology have serious design flaws, flaws that hamper the use of these ontologies in computational work.

I have heard this before, and have seen some evidence of the problem myself. There seem to be two camps. In one camp you have the people who look at ontologies mostly in light of immediate practicality: they want to start using these things today, and don’t care if they’re perfectly designed, or even designed well (I’m exaggerating I’m sure). The other, more purist camp finds this lack of concern for good design vexing, and worries about the long-term usability of the junk being pumped out by the first camp.

I’ll go ahead and take the highly noncontroversial position that both of these extremes are, well, a little extreme, and that we need to look for a compromise somewhere in the middle. The discussion reminds me to some degree of the somewhat older discussion concerning resuable design in object-oriented software. In that discussion, too, you had roughly two camps, one skeptical of the value of spending too much time on design, and the other decrying the horrible, non-reusable software hacked out by the former. This discussion has been ongoing for quite a while now, and there are a number of lessons that I think we can apply to the ontology design debate.

First, designing something right the first time never happens, no matter how much effort is spent. I think it’s reasonable to be skeptical of groups claiming to be working on designing the ontology that will capture everything and will solve everyone’s problems. Even if these groups actually complete what they set out to do, no doubt there will be issues the group will not have considered sufficiently.

This does not mean that there is no value in design, however. Point two: bad design does indeed cause headaches and waste lots of time, principally for those who come later and try to use or to evolve a bad design for new purposes.

Point three: Bad design happens. Most object-oriented code written today is not designed well. This is both a comment on the level of training most software developers get, how well their organizations help them improve, and a recognition of the inevitability of human error. I think we can expect that the same mere mortals (I include myself) will be building our biological ontologies.

Four, “good design” is not an absolute. There is ambiguity about what constitutes good object-oriented design. That is, while there are some mathematical principles to help us along, it is certainly possible to follow those principles exactly and still produce a horrible design. It is also possible to produce a beautiful design and break a rule or two. Style plays a part to some degree, not just math. Although mathematical principles may play a greater part governing the design of an ontology, I suspect that style and approach still have a role. Thus, two ontologies may not mesh well together because of differences of style or approach.

I’d say we need two things to help us design ontologies better: design standards and refactoring. Design standards (for software they’re often called coding standards) itemize the recognized areas of stylistic variation, and choose, arbitrarily, one way of handling each variation. The goal is to make code written by one person look almost identical to code written by another person, so that every person on the team can work on the code without having to modify these trivial variations to their liking each time. Readability and workability are improved, even if not everyone on the team agrees absolutely with each choice. Agreeing on common design standards for ontologies may be more difficult, however, because of the larger community that must agree.

Refactoring is the practice of altering software code in a way that does not change its function, but (hopefully) improves its design. Refactoring should make code easier to evolve and to use. Even though you know you won’t get it right the first time, you commit to refine the design each time you get a chance. Some call this practice “continuous design”. Rather than discounting the importance of design, it places design in a role rightfully central to the everyday work of each software developer.

It helps developers refactor better if they have guidelines indicating when they might want to refactor, when design could be improved and how. To this end, a number of people (notably, Martin Fowler) cataloged a list of refactorings and their associated “code smells” (a good example of the typically casual vernacular of the software development community; a casualness that I think sometimes, unfortunately, puts off some scientists who take themselves too seriously). A “code smell” is a description of a (figuratively malodorous) symptom of poor design, which one might notice when looking at a particular piece of code. The refactoring catalog then describes the refactorings, or step-by-step procedures, a developer might undertake to remove the smell.

There is much more to the practice of software refactoring, but I think the general idea applies quite well to ontologies. It would be nice if, instead of having to catalog a list of specific problems with particular ontology (as the authors of the above-mentioned article did) people could just say “oh yes, ontology X has symptom A and symptom B. You might want to consider refactorings 1, 5 or 6 to resolve those problems.” Not only would we then have a language to describe how to improve the designs of these things, I think it would also be much easier to teach people how to create better designs. Finally, successful ontologies could be refactored–improving their design, making them more interoperable with others and more amenable to computational work–without having to get things right the first time.

The truth is that many ontologies will be created, and not all of them will survive. By developing a set of refactorings and design symptoms for ontologies, we can help strengthen the valuable ones and understand better when to discard those past the point of saving.

Another comment about all of this design purism: I’ve also heard the opinion that this desire to have one mother-of-all-things ontology is a little ridiculous, because there are honest differences of opinion among scientists about how to categorize some biological concepts, and some concepts are simply ambiguous (please correct me if I’m off base here). These folks suggest that we find a way to allow multiple, incompatible ontologies to coexist, providing correspondences where necessary and possible via RDF. I’d love to hear more about this.

Hiatus breaker

September 13th, 2005

Alright, it’s been a while since I’ve posted. I’ve been up to my ears in a number of things. One, a group of six of us biomedical informatics people from across the nation (David Kane, Mike McCormick, Ethan Cerami, Karl Kuhlmann, Jeff Byrd and I) finished the first draft of a paper on agile methods in biomedical software development. We’re currently shopping it around to different journals, which thus far has been about what I expected, that is, no one has any idea what to do with us. As I’ve noted before, discussion of these matters is nearly absent in our community, at least in public. Hopefully our paper finds a home, both so it was worth the effort and because so, maybe, it helps start conversation.

Speaking of conversation, I’ll be attending the BRIITE conference in San Diego this November, at which I will try and propose a breakout session on software development practices, even though it will be a bit off topic.

I have also been working with caBIG, which has been a great opportunity to get out and meet other biomedical informatics people, learn about grid technology, etc. Shortly I’ll be co-leading the caBIG Architecture Workspace Best Practices SIG, which should be fun.

In other paper news, I have been writing a paper with Sean Martin and Ted Liefeld on the impact of LSID on biomedical informatics data. Everything I know about LSIDs I learned from Sean, Bob Robbins, Ted and others during many conversations as part of the caBIG Identifiers SIG. I never knew identifiers could be such a thorny problem, I guess the true source of the thorniness is the distributed nature of the data being identified. That would make a good blog entry some day, actually.

During the latter two quarters of the 2004-2005 academic year at Northwestern, I took/sat in on several business classes at the Kellogg school. I had always harbored the prejudice that business school classes had no content, and were all about networking. However, I was happily surprised to find out that I was just being small-minded. Although I don’t think I will ever pay the big bucks/spend the big chunk of time for the MBA, being exposed to the issues discussed in those classes was an important learning experience for me.

Finally, I went to the Agile 2005 conference this year and presented an experience report on tracking progress and estimation in software projects. We’re still working on porting the tool we’ve developed for issue tracking from ColdFusion (blech) to Ruby/Rails, but when it’s ready you can check it out at http://rhythm.sourceforge.net. This year I went with my two immediate coworkers, John and Rhett, and it was a great experience to go with them, to hear about their experiences there, and reinforce what we’ve been trying to do for the last few years. I also got to catch up with a couple people from ThoughtWorks days.

Bioinformatics Project Management

November 14th, 2004

So far bioinformaticians have, publicly at least, focused mainly on science. Any given conference or journal is full of papers about algorithms and newly available software tools. Conspicuously absent, at least to me, is any discussion of the software development process used to turn those algorithms into those working software tools. One might argue that such discussions already occur in other software communities, so reproducing them within the bioinformatics community has little utility. Also, algorithm innovation clearly deserves plenty of attention, because it has made possible this boom era of high throughput biology.

However, I think more is going on here. Other scientific software communities show the same lack of interest in discussing software methods. From personal experience I know how helpful even simple self-reflection can be for a software project, bioinformatics or otherwise, to say nothing of applying well-known, common-sense software development practices. The day-to-day work of building software involves many of the same elements, regardless of the application domain, and how you approach the work has a tremendous effect on the quality of product and on the team’s overall sense of satisfaction. Surely then, any team that has taken on the development of a sufficiently complex bioinformatics tool must appreciate the importance of software methods and processes. And yet, no one seems to be interested in talking about them.

I’m currently working on writing an editorial that explores my thoughts in greater detail, but semi-briefly, my guesses are these. First, there is a disincentive, especially in the academic community, for bioinformatics scientists to learn more about project management and software methods. Scientists mainly earn their reputations by presenting novel and important results at conferences and in journals, and time spent learning other skills detracts from this prime directive.

Second, many scientists simply don’t find project management very interesting. Their interest is chiefly in making new things possible, at least in theory, through innovation. Once they’ve shown that something can be done, they move on to find the next thing. How to actually manage a team to turn these innovations into production quality applications is not an “interesting question”, to use the cliché. Innovative research is certainly deeply compelling, and the kind of thing most scientists signed up for when they went to graduate school. Actually managing the day-to-day activities of the software development lifecycle can seem uninteresting, even trivial, to an outsider.

Third, some scientists believe that project management really is trivial. The traditional approach that many researchers resort to when they need a piece of scientific software is to get a graduate student to write the program. If it’s a more complex piece of software, get two graduate students to do it. If it’s a really really big project, then maybe they add a postdoc. This approach works well for some projects, but fails miserably for others. Usually, failure is blamed on the people involved (where, sometimes, some part of the blame fairly rests); the approach itself, however, does not usually receive much examination. I would argue that a wildly unsuitable approach is a fairly good guarantee of failure. This oversight on the part of PIs is partly a result of ignorance of software development issues and partly due to an assumption that mastery of their own discipline extends to mastery of others, when, in fact, it does not (something we are all guilty of at one time or another).

The thing is, thus far, bioinformatics has been driven by innovation, because people have concentrated on developing the algorithms that make high throughput biology possible. However, I believe we are moving into an age in which it is as important to integrate existing algorithms together into production quality applications that can serve larger groups of biologists for years on end. Getting there will require project management know-how that thus far has been largely ignored.

Finally, while it is true that the software development issues that face bioinformatics have much in common with the issues that face other types of software development, we cannot leave all discussion of bioinformatics software development issues to traditional software forums. There are things about bioinformatics software development that are unique, and we ought to provide space within the communal discourse to think them through.

But I’m not just going to talk about the problem. I’m looking for others interested to join me in the establishment of a conference on biomedical software development, either standalone or as part of another meeting. There would be tutorials on good software practices, papers presented giving project case studies, workshops on scientific software patterns and anti-patterns, keynotes from people from bioinformatics and the traditional software industry, etc. I think I could possibly get some good people from the traditional software industry interested, but I’ll need an interested group of bioinformatics folks to make this work. Please contact me at mmhohman@northwestern.edu if you’re interested in getting involved.

Directory Services: DNS and LDAP

November 14th, 2004

While attending the caBIG Architecture Workspace face-to-face in Chicago last month, I realized during a discussion that I probably didn’t know enough about directory services. One of the major challenges faced in designing the caBIG architecture is the matter of serving up Common Data Elements, or CDEs. These are standardized vocabulary terms used to compose a structured informational representation of a clinical outcome, expression level measurements, etc. Because they will be used throughout caBIG, some service must be provided to allow components of caBIG to lookup these vocabulary terms, e.g. to fill in the details of a particular dataset. This vocabulary service must be distributed and highly available for even a moderately mature grid to function properly, and avoid CDE lookups becoming a grid bottleneck.

At the face-to-face, Frank Hartel asked the room what sorts of servers and server architectures might be needed to fill such a requirement. I chimed in, naively, that we might learn something from the DNS, because of its ubiquity and high availability. Frank responded that he/they were thinking more of something like LDAP. Now, I’d heard of LDAP and knew it had something to do with looking up organizational directory information (names, email addresses, phone numbers), but had no real hold on the scope of the LDAP standard and the problems it was capable of addressing.

Since then, I’ve tried to do my homework, have read both about the DNS and LDAP directory services and have thought about how they might be applied to serve vocabularies to the caBIG grid. The following are some preliminary thoughts.

Both DNS and LDAP serve information organized in a tree of nodes, each node having a particular class and containing some number of attributes. In DNS, this tree is called the Domain Name Space, and in LDAP it is called the Directory Information Tree (DIT). Each node of the tree also has a unique name. In DNS this name is the fully-qualified domain name (FQDN), and in LDAP this is the distinguished name (DN). Highly distinguished indeed. In DNS there are not that many node classes, by far the most common being the IN, or Internet, class. By contrast, there are a very large number of LDAP node objectclasses. Each LDAP objectclass must be assigned a globally unique object identifier (OID), which look like 1.1.3.232.23.242.1.52, i.e. a string of integers separated by periods. The OIDs fit into their own tree according to a hierarchical structure imposed on the organizations to which they are assigned (or something like that).

The attributes of a node in DNS are called resource records (RRs). The most well-known type of resource record is the A, or address, record, which for IN nodes stores an IP address associated with the FQDN of the node. Common LDAP attributes are, for example, dc (domain component), dn (distinguished name, which, as an attribute, is a relative name, or the most specific part of the full DN of the node), o (organization), cn (common name), and sn (surname).

In DNS, a node’s class does not determine which attributes it may contain. Any node can potentially have any attribute. The class refers to type of network about which information is provided (IN = internet, CH = chaos). Therefore, extending DNS is a matter of adding new attributes, or types of resource records. By contrast, in LDAP, a node’s objectclass determines the types of attributes that node must and may contain. Extending LDAP is more flexible than extending DNS, and would require writing new objectclass and attribute definitions in an LDAP schema.

Both LDAP and DNS provide for information about different parts of the tree being served by different servers on the network. The directory tree is split into “zones” (the DNS word) or “partitions” (the LDAP word), which are subgraphs of elements that share a common ancestor plus the common ancestor (the “root”). A particular server can refer requests that it receives for parts of the directory tree outside of its zone/paritition to other servers. Servers can also cache results of requests referred to other servers, as well as cache results of requests performed on their local directory. Finally, a given server can be made redundant for load-balancing and failover purposes. Synchronization between the master server and its replicas occurs via a pull model in DNS (replicas periodically poll the master for updates), whereas in LDAP either a push (the master notifies the replicas of updates) or a pull model can be used.

On the face of things, the two technologies seem rather equivalent. However, DNS is decidedly the more mature technology. It servers millions of users every day, there are thousands of DNS servers deployed on the network, and every computer has a command-line tool to debug DNS queries (dig or nslookup). LDAP is a standard for organization-wide directory services, but it is not in great use cross-organizationally (as far as I know). DNS was designed to provide a fairly narrow range of directory services; by contrast, LDAP is much more general.

Thus far, the two tools have largely been applied to very different problems, so it’s difficult to compare how the two might fare at the same task.

Why is BioPerl the favorite?

September 11th, 2004

BioPerl is the most successful (as far as I can tell) of the various Open Bioinformatics Foundation projects. To me this is strange, because most programmers I know (and here I am including programmers outside the field of bioinformatics) find Perl either a little distasteful or a little passé. There certainly are some saltworthy programmers who don’t feel this way, but they’re not the majority.

I think this has to do with a) the users of these code libraries, and b) the problems these types of code libraries best solve in general. This is a wild guess, because I certainly don’t know a good cross-section of users of these libraries, nor do I have a solid handle on what sorts of problems they’re used to solve. Nevertheless, I’ll be wild and continue this train of thought.

The users are typically people who are not programmers by profession. They came to programming from biology or some other field. Their shtick is finding scientifically interesting patterns in masses of biological data, and they know how to wire together a bunch of scripts to do this. They don’t have to create stable, production-quality software. They just have to get the right answer. Perl is great for these people, because it’s basically a procedural language (yeah, bless me a hash all you want, Perl is really used primarily as a procedural language), so the programming model is easy for someone with less programming experience, and it lends itself very well to creating a library of reusable scripts.

Java, on the other hand, is a language better suited to production-quality applications. It requires a bigger upfront investment than Perl, essentially because there’s more structure, and that structure pays off down the road for larger, long-lived applications. But these applications are not typically a bunch of scripts strung together (otherwise you really shouldn’t be using Java). The Java libraries that do thrive are usually one of several competing solutions to a smaller, focused problem, not a huge mass of single solutions to a bunch of loosely related problems. BioJava fits more in the latter category.

So, why is BioRuby so far behind? Ruby is a language for programming geeks. People with less programming experience tend not to appreciate Ruby’s pure object-orientedness, its blocks/closures, etc. These features are really more confusing than valuable for your typical bioinformatician. It’s also a relative newcomer to the language scene, and a lot of the documentation is in Japanese.

Just in case it needs to be said, there’s nothing wrong with being a bioinformatician with less programming experience. These people are a whole lot better at science than your average Java business software development guru. Everyone has their strengths and their weaknesses.

In a glaring omission, I didn’t even mention BioPython, perhaps the most puzzling second-runner of them all, since it offers a nice compromise between Perl and Ruby, although it isn’t so far behind after all.