BRIITE 2005 La Jolla

December 9th, 2005

Last month I attended the BRIITE meeting for the first time. As its website says, the meeting’s mission is to:

  • Establish personal contacts, bringing together those responsible for research computing activities at biomedical research institutions
  • Identify and document common problems and interests
  • Seek opportunities for partnership / consortium activities
  • Identify common issues that should be brought to attention of home institutions, government and other funding agencies

One of the things I really liked about BRIITE was its focus on stimulating offline discussions. I went to the meeting hoping to find out if there were others in the biomedical informatics community who cared about software development issues. The focus of the meeting was “IT Support for Multi-Institution Collaborative Research”. We heard talks on federated identity management, Globus, GridShib, BIRN, and, slightly off-topic but still pretty neat, the Research Channel. So, the meeting was mostly about distributed security, and many attending were not primarily software development people.

Due to the focus on offline discussions, attendees are encouraged to suggest topics of interest, and then others sign up to discuss these topics as a group (a conference practice I’ve heard called a “Birds of a Feather” session). I proposed software development practices as a topic, and and a small group of us got together for a lively two-hour discussion. I will summarize what we talked about here.

People were interested in the topic for a variety of reasons. One person wanted to know how to introduce better development practices to his group. Another wanted input on how to manage many small projects at once (a common problem in biomedical informatics). Another was plagued with the problem of funding shared informatics resources at a biomedical research institution (people clearly need these resources, but no one wants to or can pay for them). Another wanted to learn more about agile software methods. Finally, I wanted to talk about how better to promote awareness and dicussion of software practices in the biomedical informatics community.

In the interest of brevity I’ll just list bullet points rather than go on and on about each item.

Funding IT

While not strictly a software development practice topic, anyone who wants to assemble a decent-sized software team at a biomedical research institution runs up against the problem of funding it. Most of the helpful tips here came from Charles Donnelly of the Jackson Laboratory.

  • If these resources don’t exist, who provides seed funding to get them started? Investigators typically cannot fund these things from their research grants. Institutional funding and commitment is necessary. Where it exists, people have seen some success in developing shared informatics resources (e.g. the Jackson Laboratory).
  • Funding improves if core informatics staff reviews research grants to help make them more reaslitic about informatics needs and funding. Encouraging awareness among scientific investigators and administrators is key.
  • Once you have institutional funding, keep track of the time you spend on various grants. Calculate the percentage of your total work this amounts to, and tell the PIs on those grants. This will help them understand how much their informatics costs, even if they aren’t paying for it yet.
  • Generally, PIs that have done some computing in the past are more forward-looking, so these are good people to start with (no-brainer).
  • Funding informatics from research grants is difficult, because all budgeting is done ahead of time. Often, however, the informatics needs change over the course of the scientific project. How do you come up with specifics like $ and number of FTEs before a project begins? (The answer, it seems to me, is you do what scientists do. You make your best guess, and then you adjust what you produce based on what you have the money to produce once you learn what is possible during the course of the project.)

Project Management

Mix of suggestions and problems here . . .

  • One way to scale your capabilities effectively is to develop shared resources used by several groups. Supporting shared resources can be a challenge, however, because PIs may want results formatted their own way, not in a common format.
  • Before embarking on a project, draw up a project charter. This document establishes the level of involvement of key people (on both the informatics and science side) and the high-level outline of the work planned. The process of preparing this document will let you know if science people are invested enough in the project, or if there are reasons to worry. One cannot underestimate the importance of buy-in and personal investment in the success of an informatics endeavor by the science leaders on the project.
  • Promote awareness of software development: Jax has a software lifecycle process approved by the institution and posted up everywhere.

Requirements/Communication

  • Some used a detailed requirements gathering process (at Jax they draw up 150 page documents) to firm up what would be developed.
  • Others used agile approaches, which are interactive with the scientists throughout the life of the project and make it a specific point to allow the requirements to evolve along with the scientists’ understanding of their needs.
  • All agreed that requirements gathering in biomedical software has to be an ongoing, interactive endeavor - biologists sometimes want to wait until features are in production before they give feedback, they want to see something working and tinker with it.
  • Underestimated opportunity for software people: helping biologists with their ad-hoc solutions (Access, Excel, FileMaker Pro) — biologists will always use Excel, so is there some way we can help them use Excel better? Can we leverage their familiarity with these technologies? A software development background teaches you to look askance on “low-tech” solutions like these, but in this domain there is something to be said for them.
  • Starting with the front end (UI) and working back helps user interaction — you can show a prototype of an interface and get better feedback earlier.
  • Best ideas don’t always come from biologists: we are doing our job best when we are helping scientists understand what is possible with software, and helping them focus on their goals and how software could help achieve them.

Looking Forward

We talked about possible content for a meeting or conference on biomedical and bioinformatics software development. People showed interest in the following.

  • Documenting typical staffing models — help us all understand our options for organizing our work and getting it funded
  • Questionnaire — find out what people are doing, try to get a picture of the current state of affairs
  • Present experience reports: What working examples are out there?
  • People preferred a workshop more than a conference, a hands-on experience. Plenty of short talks that don’t give answers, but raise questions - followed by open discussions.
  • People would like to see defined milestones for the meeting.
  • Tutorials on some subjects (e.g. project management, quality assurance and testing, good development practices) would be welcome
  • Q&A sessions - what have you done that can benefit others, what are you doing?

Horses for courses

October 28th, 2005

(Post updated 10/27/05; I didn’t quite like the previous version.) This post is in response to a comment I received from Greg Tyrelle, which reminded me of posts I read on Flags and Lollipops and Propeller Twist several weeks back, and wanted to comment on. I think a longer post is more appropriate than a shorter reply to the comment.

When I came across Stew’s Flags and Lollipops post myself several weeks ago, I enjoyed reading it. My favorite line is the second to last one: “we should be able to accept that no single software methodology or mindset necessarily suits all situations.” I completely agree, and I’m glad to see a bioinformaticist reflect on what approach might be most appropriate to a common form of bioinformatics programming — what we might call paper programming, in which only one programmer ever sees/edits the code, and the code gets thrown away once a paper is published.

However, I don’t think it follows (as Greg implies some may believe) that such bioinformaticians can therefore blithely shun discussion of software methods (indeed Stew himself patently is not a shunner). Rather they should be aware that they have explicitly chosen one method over the other methods available to them, and they should be able to provide decent reasons for choosing so.

In addition, paper programming isn’t the only kind of bioinformatics software under development today (as Fabrice Jossinet points out in his response to the Flags and Lollipops post). A couple years ago I came across a presentation by Maury Leysens that attempted to categorize bioinformatics software projects by the method most appropriate for their management. I thought this was a great effort, and it makes the same very important point that we should not approach every bioinformatics project with the same methods and expect good results. Our work is not homogeneous.

Choice of approach, of method, determines how we go about our daily work. In a very fundamental way this choice affects both our results and our job satisfaction. For this reason I find it strange (as I’ve said before) that in our community this choice receives so little discussion.

You see mistakes being made with both too little and too much structure. On the one hand, you see bioinformaticians and biologists slogging away at their informatics with no thought given to software practices, and everyone suffering for it. On the other hand, you have bioinformaticians who insist that their ISO-9001-approved, iron-clad, high-ceremony, bend-over-backwards-quadruple-backflip methods are the only way anyone will ever produce an acceptable bioinformatics tool. These folks can miss out on the opportunities that occur when people have more room to experiment and have a bit more fun.

More important still, there are groups out there that are figuring out better ways to approach building biomedical software. Yet we don’t often hear about these improvements except by accident. As I’ve said before, I think it would be valuable to our community if we started talking about software methods more often, so we can learn both from the broader software community (where these things are discussed commonly) and from each other’s experience.

I think that this discussion is becoming increasingly important to success in biomedical informatics. The types of applications we are asked to build are becoming more heterogeneous. More and more we are being asked to produce systems that outlive the publication of a paper. Informatics projects require more attention to the selection of appropriate software development methods than they have in the past.

Enough generalities. I have been working on a post about user collaboration in biomedical software that I will finalize in the next day or two (or three . . .), so please look for that if you want to hear something more specific.

(Footnote: regarding software reuse (the subject of Stew’s original post), there was recently a nice post, “What the space shuttle taught us about reuse“, on Code Craft about it. Just doing my part to bring software people and bioinformatics people together : )

Refactoring bio-ontologies

October 7th, 2005

Recently I came across a reference to this article from Nature Biotechnology about the insufficiencies of the current implementations of biological ontologies. The article points out that most if not all current ontologies used or being developed for computational biology have serious design flaws, flaws that hamper the use of these ontologies in computational work.

I have heard this before, and have seen some evidence of the problem myself. There seem to be two camps. In one camp you have the people who look at ontologies mostly in light of immediate practicality: they want to start using these things today, and don’t care if they’re perfectly designed, or even designed well (I’m exaggerating I’m sure). The other, more purist camp finds this lack of concern for good design vexing, and worries about the long-term usability of the junk being pumped out by the first camp.

I’ll go ahead and take the highly noncontroversial position that both of these extremes are, well, a little extreme, and that we need to look for a compromise somewhere in the middle. The discussion reminds me to some degree of the somewhat older discussion concerning resuable design in object-oriented software. In that discussion, too, you had roughly two camps, one skeptical of the value of spending too much time on design, and the other decrying the horrible, non-reusable software hacked out by the former. This discussion has been ongoing for quite a while now, and there are a number of lessons that I think we can apply to the ontology design debate.

First, designing something right the first time never happens, no matter how much effort is spent. I think it’s reasonable to be skeptical of groups claiming to be working on designing the ontology that will capture everything and will solve everyone’s problems. Even if these groups actually complete what they set out to do, no doubt there will be issues the group will not have considered sufficiently.

This does not mean that there is no value in design, however. Point two: bad design does indeed cause headaches and waste lots of time, principally for those who come later and try to use or to evolve a bad design for new purposes.

Point three: Bad design happens. Most object-oriented code written today is not designed well. This is both a comment on the level of training most software developers get, how well their organizations help them improve, and a recognition of the inevitability of human error. I think we can expect that the same mere mortals (I include myself) will be building our biological ontologies.

Four, “good design” is not an absolute. There is ambiguity about what constitutes good object-oriented design. That is, while there are some mathematical principles to help us along, it is certainly possible to follow those principles exactly and still produce a horrible design. It is also possible to produce a beautiful design and break a rule or two. Style plays a part to some degree, not just math. Although mathematical principles may play a greater part governing the design of an ontology, I suspect that style and approach still have a role. Thus, two ontologies may not mesh well together because of differences of style or approach.

I’d say we need two things to help us design ontologies better: design standards and refactoring. Design standards (for software they’re often called coding standards) itemize the recognized areas of stylistic variation, and choose, arbitrarily, one way of handling each variation. The goal is to make code written by one person look almost identical to code written by another person, so that every person on the team can work on the code without having to modify these trivial variations to their liking each time. Readability and workability are improved, even if not everyone on the team agrees absolutely with each choice. Agreeing on common design standards for ontologies may be more difficult, however, because of the larger community that must agree.

Refactoring is the practice of altering software code in a way that does not change its function, but (hopefully) improves its design. Refactoring should make code easier to evolve and to use. Even though you know you won’t get it right the first time, you commit to refine the design each time you get a chance. Some call this practice “continuous design”. Rather than discounting the importance of design, it places design in a role rightfully central to the everyday work of each software developer.

It helps developers refactor better if they have guidelines indicating when they might want to refactor, when design could be improved and how. To this end, a number of people (notably, Martin Fowler) cataloged a list of refactorings and their associated “code smells” (a good example of the typically casual vernacular of the software development community; a casualness that I think sometimes, unfortunately, puts off some scientists who take themselves too seriously). A “code smell” is a description of a (figuratively malodorous) symptom of poor design, which one might notice when looking at a particular piece of code. The refactoring catalog then describes the refactorings, or step-by-step procedures, a developer might undertake to remove the smell.

There is much more to the practice of software refactoring, but I think the general idea applies quite well to ontologies. It would be nice if, instead of having to catalog a list of specific problems with particular ontology (as the authors of the above-mentioned article did) people could just say “oh yes, ontology X has symptom A and symptom B. You might want to consider refactorings 1, 5 or 6 to resolve those problems.” Not only would we then have a language to describe how to improve the designs of these things, I think it would also be much easier to teach people how to create better designs. Finally, successful ontologies could be refactored–improving their design, making them more interoperable with others and more amenable to computational work–without having to get things right the first time.

The truth is that many ontologies will be created, and not all of them will survive. By developing a set of refactorings and design symptoms for ontologies, we can help strengthen the valuable ones and understand better when to discard those past the point of saving.

Another comment about all of this design purism: I’ve also heard the opinion that this desire to have one mother-of-all-things ontology is a little ridiculous, because there are honest differences of opinion among scientists about how to categorize some biological concepts, and some concepts are simply ambiguous (please correct me if I’m off base here). These folks suggest that we find a way to allow multiple, incompatible ontologies to coexist, providing correspondences where necessary and possible via RDF. I’d love to hear more about this.

Hiatus breaker

September 13th, 2005

Alright, it’s been a while since I’ve posted. I’ve been up to my ears in a number of things. One, a group of six of us biomedical informatics people from across the nation (David Kane, Mike McCormick, Ethan Cerami, Karl Kuhlmann, Jeff Byrd and I) finished the first draft of a paper on agile methods in biomedical software development. We’re currently shopping it around to different journals, which thus far has been about what I expected, that is, no one has any idea what to do with us. As I’ve noted before, discussion of these matters is nearly absent in our community, at least in public. Hopefully our paper finds a home, both so it was worth the effort and because so, maybe, it helps start conversation.

Speaking of conversation, I’ll be attending the BRIITE conference in San Diego this November, at which I will try and propose a breakout session on software development practices, even though it will be a bit off topic.

I have also been working with caBIG, which has been a great opportunity to get out and meet other biomedical informatics people, learn about grid technology, etc. Shortly I’ll be co-leading the caBIG Architecture Workspace Best Practices SIG, which should be fun.

In other paper news, I have been writing a paper with Sean Martin and Ted Liefeld on the impact of LSID on biomedical informatics data. Everything I know about LSIDs I learned from Sean, Bob Robbins, Ted and others during many conversations as part of the caBIG Identifiers SIG. I never knew identifiers could be such a thorny problem, I guess the true source of the thorniness is the distributed nature of the data being identified. That would make a good blog entry some day, actually.

During the latter two quarters of the 2004-2005 academic year at Northwestern, I took/sat in on several business classes at the Kellogg school. I had always harbored the prejudice that business school classes had no content, and were all about networking. However, I was happily surprised to find out that I was just being small-minded. Although I don’t think I will ever pay the big bucks/spend the big chunk of time for the MBA, being exposed to the issues discussed in those classes was an important learning experience for me.

Finally, I went to the Agile 2005 conference this year and presented an experience report on tracking progress and estimation in software projects. We’re still working on porting the tool we’ve developed for issue tracking from ColdFusion (blech) to Ruby/Rails, but when it’s ready you can check it out at http://rhythm.sourceforge.net. This year I went with my two immediate coworkers, John and Rhett, and it was a great experience to go with them, to hear about their experiences there, and reinforce what we’ve been trying to do for the last few years. I also got to catch up with a couple people from ThoughtWorks days.

Bioinformatics Project Management

November 14th, 2004

So far bioinformaticians have, publicly at least, focused mainly on science. Any given conference or journal is full of papers about algorithms and newly available software tools. Conspicuously absent, at least to me, is any discussion of the software development process used to turn those algorithms into those working software tools. One might argue that such discussions already occur in other software communities, so reproducing them within the bioinformatics community has little utility. Also, algorithm innovation clearly deserves plenty of attention, because it has made possible this boom era of high throughput biology.

However, I think more is going on here. Other scientific software communities show the same lack of interest in discussing software methods. From personal experience I know how helpful even simple self-reflection can be for a software project, bioinformatics or otherwise, to say nothing of applying well-known, common-sense software development practices. The day-to-day work of building software involves many of the same elements, regardless of the application domain, and how you approach the work has a tremendous effect on the quality of product and on the team’s overall sense of satisfaction. Surely then, any team that has taken on the development of a sufficiently complex bioinformatics tool must appreciate the importance of software methods and processes. And yet, no one seems to be interested in talking about them.

I’m currently working on writing an editorial that explores my thoughts in greater detail, but semi-briefly, my guesses are these. First, there is a disincentive, especially in the academic community, for bioinformatics scientists to learn more about project management and software methods. Scientists mainly earn their reputations by presenting novel and important results at conferences and in journals, and time spent learning other skills detracts from this prime directive.

Second, many scientists simply don’t find project management very interesting. Their interest is chiefly in making new things possible, at least in theory, through innovation. Once they’ve shown that something can be done, they move on to find the next thing. How to actually manage a team to turn these innovations into production quality applications is not an “interesting question”, to use the cliché. Innovative research is certainly deeply compelling, and the kind of thing most scientists signed up for when they went to graduate school. Actually managing the day-to-day activities of the software development lifecycle can seem uninteresting, even trivial, to an outsider.

Third, some scientists believe that project management really is trivial. The traditional approach that many researchers resort to when they need a piece of scientific software is to get a graduate student to write the program. If it’s a more complex piece of software, get two graduate students to do it. If it’s a really really big project, then maybe they add a postdoc. This approach works well for some projects, but fails miserably for others. Usually, failure is blamed on the people involved (where, sometimes, some part of the blame fairly rests); the approach itself, however, does not usually receive much examination. I would argue that a wildly unsuitable approach is a fairly good guarantee of failure. This oversight on the part of PIs is partly a result of ignorance of software development issues and partly due to an assumption that mastery of their own discipline extends to mastery of others, when, in fact, it does not (something we are all guilty of at one time or another).

The thing is, thus far, bioinformatics has been driven by innovation, because people have concentrated on developing the algorithms that make high throughput biology possible. However, I believe we are moving into an age in which it is as important to integrate existing algorithms together into production quality applications that can serve larger groups of biologists for years on end. Getting there will require project management know-how that thus far has been largely ignored.

Finally, while it is true that the software development issues that face bioinformatics have much in common with the issues that face other types of software development, we cannot leave all discussion of bioinformatics software development issues to traditional software forums. There are things about bioinformatics software development that are unique, and we ought to provide space within the communal discourse to think them through.

But I’m not just going to talk about the problem. I’m looking for others interested to join me in the establishment of a conference on biomedical software development, either standalone or as part of another meeting. There would be tutorials on good software practices, papers presented giving project case studies, workshops on scientific software patterns and anti-patterns, keynotes from people from bioinformatics and the traditional software industry, etc. I think I could possibly get some good people from the traditional software industry interested, but I’ll need an interested group of bioinformatics folks to make this work. Please contact me at mmhohman@northwestern.edu if you’re interested in getting involved.