Horses for courses

October 28th, 2005

(Post updated 10/27/05; I didn’t quite like the previous version.) This post is in response to a comment I received from Greg Tyrelle, which reminded me of posts I read on Flags and Lollipops and Propeller Twist several weeks back, and wanted to comment on. I think a longer post is more appropriate than a shorter reply to the comment.

When I came across Stew’s Flags and Lollipops post myself several weeks ago, I enjoyed reading it. My favorite line is the second to last one: “we should be able to accept that no single software methodology or mindset necessarily suits all situations.” I completely agree, and I’m glad to see a bioinformaticist reflect on what approach might be most appropriate to a common form of bioinformatics programming — what we might call paper programming, in which only one programmer ever sees/edits the code, and the code gets thrown away once a paper is published.

However, I don’t think it follows (as Greg implies some may believe) that such bioinformaticians can therefore blithely shun discussion of software methods (indeed Stew himself patently is not a shunner). Rather they should be aware that they have explicitly chosen one method over the other methods available to them, and they should be able to provide decent reasons for choosing so.

In addition, paper programming isn’t the only kind of bioinformatics software under development today (as Fabrice Jossinet points out in his response to the Flags and Lollipops post). A couple years ago I came across a presentation by Maury Leysens that attempted to categorize bioinformatics software projects by the method most appropriate for their management. I thought this was a great effort, and it makes the same very important point that we should not approach every bioinformatics project with the same methods and expect good results. Our work is not homogeneous.

Choice of approach, of method, determines how we go about our daily work. In a very fundamental way this choice affects both our results and our job satisfaction. For this reason I find it strange (as I’ve said before) that in our community this choice receives so little discussion.

You see mistakes being made with both too little and too much structure. On the one hand, you see bioinformaticians and biologists slogging away at their informatics with no thought given to software practices, and everyone suffering for it. On the other hand, you have bioinformaticians who insist that their ISO-9001-approved, iron-clad, high-ceremony, bend-over-backwards-quadruple-backflip methods are the only way anyone will ever produce an acceptable bioinformatics tool. These folks can miss out on the opportunities that occur when people have more room to experiment and have a bit more fun.

More important still, there are groups out there that are figuring out better ways to approach building biomedical software. Yet we don’t often hear about these improvements except by accident. As I’ve said before, I think it would be valuable to our community if we started talking about software methods more often, so we can learn both from the broader software community (where these things are discussed commonly) and from each other’s experience.

I think that this discussion is becoming increasingly important to success in biomedical informatics. The types of applications we are asked to build are becoming more heterogeneous. More and more we are being asked to produce systems that outlive the publication of a paper. Informatics projects require more attention to the selection of appropriate software development methods than they have in the past.

Enough generalities. I have been working on a post about user collaboration in biomedical software that I will finalize in the next day or two (or three . . .), so please look for that if you want to hear something more specific.

(Footnote: regarding software reuse (the subject of Stew’s original post), there was recently a nice post, “What the space shuttle taught us about reuse“, on Code Craft about it. Just doing my part to bring software people and bioinformatics people together : )

Refactoring bio-ontologies

October 7th, 2005

Recently I came across a reference to this article from Nature Biotechnology about the insufficiencies of the current implementations of biological ontologies. The article points out that most if not all current ontologies used or being developed for computational biology have serious design flaws, flaws that hamper the use of these ontologies in computational work.

I have heard this before, and have seen some evidence of the problem myself. There seem to be two camps. In one camp you have the people who look at ontologies mostly in light of immediate practicality: they want to start using these things today, and don’t care if they’re perfectly designed, or even designed well (I’m exaggerating I’m sure). The other, more purist camp finds this lack of concern for good design vexing, and worries about the long-term usability of the junk being pumped out by the first camp.

I’ll go ahead and take the highly noncontroversial position that both of these extremes are, well, a little extreme, and that we need to look for a compromise somewhere in the middle. The discussion reminds me to some degree of the somewhat older discussion concerning resuable design in object-oriented software. In that discussion, too, you had roughly two camps, one skeptical of the value of spending too much time on design, and the other decrying the horrible, non-reusable software hacked out by the former. This discussion has been ongoing for quite a while now, and there are a number of lessons that I think we can apply to the ontology design debate.

First, designing something right the first time never happens, no matter how much effort is spent. I think it’s reasonable to be skeptical of groups claiming to be working on designing the ontology that will capture everything and will solve everyone’s problems. Even if these groups actually complete what they set out to do, no doubt there will be issues the group will not have considered sufficiently.

This does not mean that there is no value in design, however. Point two: bad design does indeed cause headaches and waste lots of time, principally for those who come later and try to use or to evolve a bad design for new purposes.

Point three: Bad design happens. Most object-oriented code written today is not designed well. This is both a comment on the level of training most software developers get, how well their organizations help them improve, and a recognition of the inevitability of human error. I think we can expect that the same mere mortals (I include myself) will be building our biological ontologies.

Four, “good design” is not an absolute. There is ambiguity about what constitutes good object-oriented design. That is, while there are some mathematical principles to help us along, it is certainly possible to follow those principles exactly and still produce a horrible design. It is also possible to produce a beautiful design and break a rule or two. Style plays a part to some degree, not just math. Although mathematical principles may play a greater part governing the design of an ontology, I suspect that style and approach still have a role. Thus, two ontologies may not mesh well together because of differences of style or approach.

I’d say we need two things to help us design ontologies better: design standards and refactoring. Design standards (for software they’re often called coding standards) itemize the recognized areas of stylistic variation, and choose, arbitrarily, one way of handling each variation. The goal is to make code written by one person look almost identical to code written by another person, so that every person on the team can work on the code without having to modify these trivial variations to their liking each time. Readability and workability are improved, even if not everyone on the team agrees absolutely with each choice. Agreeing on common design standards for ontologies may be more difficult, however, because of the larger community that must agree.

Refactoring is the practice of altering software code in a way that does not change its function, but (hopefully) improves its design. Refactoring should make code easier to evolve and to use. Even though you know you won’t get it right the first time, you commit to refine the design each time you get a chance. Some call this practice “continuous design”. Rather than discounting the importance of design, it places design in a role rightfully central to the everyday work of each software developer.

It helps developers refactor better if they have guidelines indicating when they might want to refactor, when design could be improved and how. To this end, a number of people (notably, Martin Fowler) cataloged a list of refactorings and their associated “code smells” (a good example of the typically casual vernacular of the software development community; a casualness that I think sometimes, unfortunately, puts off some scientists who take themselves too seriously). A “code smell” is a description of a (figuratively malodorous) symptom of poor design, which one might notice when looking at a particular piece of code. The refactoring catalog then describes the refactorings, or step-by-step procedures, a developer might undertake to remove the smell.

There is much more to the practice of software refactoring, but I think the general idea applies quite well to ontologies. It would be nice if, instead of having to catalog a list of specific problems with particular ontology (as the authors of the above-mentioned article did) people could just say “oh yes, ontology X has symptom A and symptom B. You might want to consider refactorings 1, 5 or 6 to resolve those problems.” Not only would we then have a language to describe how to improve the designs of these things, I think it would also be much easier to teach people how to create better designs. Finally, successful ontologies could be refactored–improving their design, making them more interoperable with others and more amenable to computational work–without having to get things right the first time.

The truth is that many ontologies will be created, and not all of them will survive. By developing a set of refactorings and design symptoms for ontologies, we can help strengthen the valuable ones and understand better when to discard those past the point of saving.

Another comment about all of this design purism: I’ve also heard the opinion that this desire to have one mother-of-all-things ontology is a little ridiculous, because there are honest differences of opinion among scientists about how to categorize some biological concepts, and some concepts are simply ambiguous (please correct me if I’m off base here). These folks suggest that we find a way to allow multiple, incompatible ontologies to coexist, providing correspondences where necessary and possible via RDF. I’d love to hear more about this.