Dependency Injection != Service Locator

November 24th, 2004

Part of any Dependency Injection (DI) solution is some sort of registry that contains all of the configured objects with their declared dependencies resolved. In Spring this is the ApplicationContext, and in PicoContainer it’s, well, the PicoContainer.

An alternative application configuration pattern is Service Locator, itself a kind of registry with which any object can look up its dependencies. Because both these patterns have a registry object, it is possible to use the Dependency Injection container as a Service Locator, making all objects dependent on the container class (let’s stick with Spring because I know it better), the ApplicationContext. If you’re going to do Dependency Injection, however, it would be better to have objects coupled only to their precise dependencies, rather than passing an instance of the ApplicationContext to an object.

However, there are more arguments against this than just purity of approach. We recently ran into this tendency on Neuromice, which uses Spring for DI. We also use Quartz for job scheduling, and we had been passing an instance of the ApplicationContext to each job through its JobExecutionContext. The problem with this was that, during unit testing, we needed to pass a valid testing version of our entire ApplicationContext to the job. Theoretically, we could have built a special ApplicationContext that contained only those dependencies needed by the job. However, in Spring at least, that usually means lots of little XML configuration files, so it was easier to have just two context options, the real one and the testing one (which contains all of the data access objects working off of HSQLDB instead of Oracle). Because our application has become fairly complex behind the scenes, creating even the testing version of this ApplicationContext takes ten seconds or so, slowing down our unit tests. It is much faster to create a few mock objects and pass them directly to the job during unit testing. More compelling, this also allows simpler configuration of the interaction of the job with these dependencies. Most compelling, at least to me, passing precise mock objects also does a better job of isolating what’s being tested to just the code in the job. Thankfully, we realized that Spring had recently added a QuartzJobBean superclass, which, if you inherit from it, will set bean property dependencies on your job from attributes in the JobExecutionContext or the SchedulerContext. This means that during unit testing, I can pass a null JobExecutionContext to the job, first fulfilling all the dependencies with mock objects via setters. When the application runs normally, these dependencies are resolved using the beans configured in the ApplicationContext. Very nice.

As a postscript, I recently noticed that someone else beat me to this observation.

Bioinformatics Project Management

November 14th, 2004

So far bioinformaticians have, publicly at least, focused mainly on science. Any given conference or journal is full of papers about algorithms and newly available software tools. Conspicuously absent, at least to me, is any discussion of the software development process used to turn those algorithms into those working software tools. One might argue that such discussions already occur in other software communities, so reproducing them within the bioinformatics community has little utility. Also, algorithm innovation clearly deserves plenty of attention, because it has made possible this boom era of high throughput biology.

However, I think more is going on here. Other scientific software communities show the same lack of interest in discussing software methods. From personal experience I know how helpful even simple self-reflection can be for a software project, bioinformatics or otherwise, to say nothing of applying well-known, common-sense software development practices. The day-to-day work of building software involves many of the same elements, regardless of the application domain, and how you approach the work has a tremendous effect on the quality of product and on the team’s overall sense of satisfaction. Surely then, any team that has taken on the development of a sufficiently complex bioinformatics tool must appreciate the importance of software methods and processes. And yet, no one seems to be interested in talking about them.

I’m currently working on writing an editorial that explores my thoughts in greater detail, but semi-briefly, my guesses are these. First, there is a disincentive, especially in the academic community, for bioinformatics scientists to learn more about project management and software methods. Scientists mainly earn their reputations by presenting novel and important results at conferences and in journals, and time spent learning other skills detracts from this prime directive.

Second, many scientists simply don’t find project management very interesting. Their interest is chiefly in making new things possible, at least in theory, through innovation. Once they’ve shown that something can be done, they move on to find the next thing. How to actually manage a team to turn these innovations into production quality applications is not an “interesting question”, to use the cliché. Innovative research is certainly deeply compelling, and the kind of thing most scientists signed up for when they went to graduate school. Actually managing the day-to-day activities of the software development lifecycle can seem uninteresting, even trivial, to an outsider.

Third, some scientists believe that project management really is trivial. The traditional approach that many researchers resort to when they need a piece of scientific software is to get a graduate student to write the program. If it’s a more complex piece of software, get two graduate students to do it. If it’s a really really big project, then maybe they add a postdoc. This approach works well for some projects, but fails miserably for others. Usually, failure is blamed on the people involved (where, sometimes, some part of the blame fairly rests); the approach itself, however, does not usually receive much examination. I would argue that a wildly unsuitable approach is a fairly good guarantee of failure. This oversight on the part of PIs is partly a result of ignorance of software development issues and partly due to an assumption that mastery of their own discipline extends to mastery of others, when, in fact, it does not (something we are all guilty of at one time or another).

The thing is, thus far, bioinformatics has been driven by innovation, because people have concentrated on developing the algorithms that make high throughput biology possible. However, I believe we are moving into an age in which it is as important to integrate existing algorithms together into production quality applications that can serve larger groups of biologists for years on end. Getting there will require project management know-how that thus far has been largely ignored.

Finally, while it is true that the software development issues that face bioinformatics have much in common with the issues that face other types of software development, we cannot leave all discussion of bioinformatics software development issues to traditional software forums. There are things about bioinformatics software development that are unique, and we ought to provide space within the communal discourse to think them through.

But I’m not just going to talk about the problem. I’m looking for others interested to join me in the establishment of a conference on biomedical software development, either standalone or as part of another meeting. There would be tutorials on good software practices, papers presented giving project case studies, workshops on scientific software patterns and anti-patterns, keynotes from people from bioinformatics and the traditional software industry, etc. I think I could possibly get some good people from the traditional software industry interested, but I’ll need an interested group of bioinformatics folks to make this work. Please contact me at mmhohman@northwestern.edu if you’re interested in getting involved.

Directory Services: DNS and LDAP

November 14th, 2004

While attending the caBIG Architecture Workspace face-to-face in Chicago last month, I realized during a discussion that I probably didn’t know enough about directory services. One of the major challenges faced in designing the caBIG architecture is the matter of serving up Common Data Elements, or CDEs. These are standardized vocabulary terms used to compose a structured informational representation of a clinical outcome, expression level measurements, etc. Because they will be used throughout caBIG, some service must be provided to allow components of caBIG to lookup these vocabulary terms, e.g. to fill in the details of a particular dataset. This vocabulary service must be distributed and highly available for even a moderately mature grid to function properly, and avoid CDE lookups becoming a grid bottleneck.

At the face-to-face, Frank Hartel asked the room what sorts of servers and server architectures might be needed to fill such a requirement. I chimed in, naively, that we might learn something from the DNS, because of its ubiquity and high availability. Frank responded that he/they were thinking more of something like LDAP. Now, I’d heard of LDAP and knew it had something to do with looking up organizational directory information (names, email addresses, phone numbers), but had no real hold on the scope of the LDAP standard and the problems it was capable of addressing.

Since then, I’ve tried to do my homework, have read both about the DNS and LDAP directory services and have thought about how they might be applied to serve vocabularies to the caBIG grid. The following are some preliminary thoughts.

Both DNS and LDAP serve information organized in a tree of nodes, each node having a particular class and containing some number of attributes. In DNS, this tree is called the Domain Name Space, and in LDAP it is called the Directory Information Tree (DIT). Each node of the tree also has a unique name. In DNS this name is the fully-qualified domain name (FQDN), and in LDAP this is the distinguished name (DN). Highly distinguished indeed. In DNS there are not that many node classes, by far the most common being the IN, or Internet, class. By contrast, there are a very large number of LDAP node objectclasses. Each LDAP objectclass must be assigned a globally unique object identifier (OID), which look like 1.1.3.232.23.242.1.52, i.e. a string of integers separated by periods. The OIDs fit into their own tree according to a hierarchical structure imposed on the organizations to which they are assigned (or something like that).

The attributes of a node in DNS are called resource records (RRs). The most well-known type of resource record is the A, or address, record, which for IN nodes stores an IP address associated with the FQDN of the node. Common LDAP attributes are, for example, dc (domain component), dn (distinguished name, which, as an attribute, is a relative name, or the most specific part of the full DN of the node), o (organization), cn (common name), and sn (surname).

In DNS, a node’s class does not determine which attributes it may contain. Any node can potentially have any attribute. The class refers to type of network about which information is provided (IN = internet, CH = chaos). Therefore, extending DNS is a matter of adding new attributes, or types of resource records. By contrast, in LDAP, a node’s objectclass determines the types of attributes that node must and may contain. Extending LDAP is more flexible than extending DNS, and would require writing new objectclass and attribute definitions in an LDAP schema.

Both LDAP and DNS provide for information about different parts of the tree being served by different servers on the network. The directory tree is split into “zones” (the DNS word) or “partitions” (the LDAP word), which are subgraphs of elements that share a common ancestor plus the common ancestor (the “root”). A particular server can refer requests that it receives for parts of the directory tree outside of its zone/paritition to other servers. Servers can also cache results of requests referred to other servers, as well as cache results of requests performed on their local directory. Finally, a given server can be made redundant for load-balancing and failover purposes. Synchronization between the master server and its replicas occurs via a pull model in DNS (replicas periodically poll the master for updates), whereas in LDAP either a push (the master notifies the replicas of updates) or a pull model can be used.

On the face of things, the two technologies seem rather equivalent. However, DNS is decidedly the more mature technology. It servers millions of users every day, there are thousands of DNS servers deployed on the network, and every computer has a command-line tool to debug DNS queries (dig or nslookup). LDAP is a standard for organization-wide directory services, but it is not in great use cross-organizationally (as far as I know). DNS was designed to provide a fairly narrow range of directory services; by contrast, LDAP is much more general.

Thus far, the two tools have largely been applied to very different problems, so it’s difficult to compare how the two might fare at the same task.