Refactoring bio-ontologies

Recently I came across a reference to this article from Nature Biotechnology about the insufficiencies of the current implementations of biological ontologies. The article points out that most if not all current ontologies used or being developed for computational biology have serious design flaws, flaws that hamper the use of these ontologies in computational work.

I have heard this before, and have seen some evidence of the problem myself. There seem to be two camps. In one camp you have the people who look at ontologies mostly in light of immediate practicality: they want to start using these things today, and don’t care if they’re perfectly designed, or even designed well (I’m exaggerating I’m sure). The other, more purist camp finds this lack of concern for good design vexing, and worries about the long-term usability of the junk being pumped out by the first camp.

I’ll go ahead and take the highly noncontroversial position that both of these extremes are, well, a little extreme, and that we need to look for a compromise somewhere in the middle. The discussion reminds me to some degree of the somewhat older discussion concerning resuable design in object-oriented software. In that discussion, too, you had roughly two camps, one skeptical of the value of spending too much time on design, and the other decrying the horrible, non-reusable software hacked out by the former. This discussion has been ongoing for quite a while now, and there are a number of lessons that I think we can apply to the ontology design debate.

First, designing something right the first time never happens, no matter how much effort is spent. I think it’s reasonable to be skeptical of groups claiming to be working on designing the ontology that will capture everything and will solve everyone’s problems. Even if these groups actually complete what they set out to do, no doubt there will be issues the group will not have considered sufficiently.

This does not mean that there is no value in design, however. Point two: bad design does indeed cause headaches and waste lots of time, principally for those who come later and try to use or to evolve a bad design for new purposes.

Point three: Bad design happens. Most object-oriented code written today is not designed well. This is both a comment on the level of training most software developers get, how well their organizations help them improve, and a recognition of the inevitability of human error. I think we can expect that the same mere mortals (I include myself) will be building our biological ontologies.

Four, “good design” is not an absolute. There is ambiguity about what constitutes good object-oriented design. That is, while there are some mathematical principles to help us along, it is certainly possible to follow those principles exactly and still produce a horrible design. It is also possible to produce a beautiful design and break a rule or two. Style plays a part to some degree, not just math. Although mathematical principles may play a greater part governing the design of an ontology, I suspect that style and approach still have a role. Thus, two ontologies may not mesh well together because of differences of style or approach.

I’d say we need two things to help us design ontologies better: design standards and refactoring. Design standards (for software they’re often called coding standards) itemize the recognized areas of stylistic variation, and choose, arbitrarily, one way of handling each variation. The goal is to make code written by one person look almost identical to code written by another person, so that every person on the team can work on the code without having to modify these trivial variations to their liking each time. Readability and workability are improved, even if not everyone on the team agrees absolutely with each choice. Agreeing on common design standards for ontologies may be more difficult, however, because of the larger community that must agree.

Refactoring is the practice of altering software code in a way that does not change its function, but (hopefully) improves its design. Refactoring should make code easier to evolve and to use. Even though you know you won’t get it right the first time, you commit to refine the design each time you get a chance. Some call this practice “continuous design”. Rather than discounting the importance of design, it places design in a role rightfully central to the everyday work of each software developer.

It helps developers refactor better if they have guidelines indicating when they might want to refactor, when design could be improved and how. To this end, a number of people (notably, Martin Fowler) cataloged a list of refactorings and their associated “code smells” (a good example of the typically casual vernacular of the software development community; a casualness that I think sometimes, unfortunately, puts off some scientists who take themselves too seriously). A “code smell” is a description of a (figuratively malodorous) symptom of poor design, which one might notice when looking at a particular piece of code. The refactoring catalog then describes the refactorings, or step-by-step procedures, a developer might undertake to remove the smell.

There is much more to the practice of software refactoring, but I think the general idea applies quite well to ontologies. It would be nice if, instead of having to catalog a list of specific problems with particular ontology (as the authors of the above-mentioned article did) people could just say “oh yes, ontology X has symptom A and symptom B. You might want to consider refactorings 1, 5 or 6 to resolve those problems.” Not only would we then have a language to describe how to improve the designs of these things, I think it would also be much easier to teach people how to create better designs. Finally, successful ontologies could be refactored–improving their design, making them more interoperable with others and more amenable to computational work–without having to get things right the first time.

The truth is that many ontologies will be created, and not all of them will survive. By developing a set of refactorings and design symptoms for ontologies, we can help strengthen the valuable ones and understand better when to discard those past the point of saving.

Another comment about all of this design purism: I’ve also heard the opinion that this desire to have one mother-of-all-things ontology is a little ridiculous, because there are honest differences of opinion among scientists about how to categorize some biological concepts, and some concepts are simply ambiguous (please correct me if I’m off base here). These folks suggest that we find a way to allow multiple, incompatible ontologies to coexist, providing correspondences where necessary and possible via RDF. I’d love to hear more about this.

Leave a Reply