BDD: specifying domain objects de novo

May 6th, 2007

The vanilla example used in most blog posts for BDD is some incarnation of de novo domain object specification, that is, specifying the behavior of a simple domain object from scratch. David Chelimsky’s stack example is a decent online example of this sort of situation (the comments are interesting to read as well). Stack is an independent class without any collaborators, and its behavior is not extensive. This results in a state-oriented set of contexts with specifications that are very readable and give you a good idea of what Stacks do.

Recently my colleague Kurt Schrader posted something in reaction to a discussion he and I had about a specification I had written. In his post he gives a simple example of de novo domain object specification, and asks (here I paraphrase) whether the example should be the method or the object in a particular state, i.e. should we describe the behavior of an object in a particular state (in his example, “A new sword”) or describe the behavior of the method Sword#sharp?. I would agree with him that in this case describing new and old swords is better than describing sharp?. However, the original specification that provoked the most recent iteration of this discussion was a bit more complex, so I’ll present it (correction: I’ll present a similar specification, see Note at the bottom) here.

Read the rest of this entry »

Behavior-driven development

May 6th, 2007

About two months ago at CDD we decided to start using the RSpec Behavior-driven development (BDD) framework instead of the standard Test::Unit unit-testing library. My initial interest in using RSpec was that it provided “contexts” for a bundle of tests/specifications (hereafter, “specs”), and that seemed a cleaner way to group specifications/tests than throwing everything in one big test class. Our existing Test::Unit test classes were getting very long (some with 60+ test methods if I remember correctly), and related tests were grouped just by placing them next to each other in the file, which wasn’t always maintainable/maintained. And of course, when you have sixty tests in one class, the setup method has to be too general to be used properly. So we needed to do something, and RSpec seemed like it would help. In addition, I liked how specs in RSpec read better than how a Test::Unit assertion reads, i.e. I liked assigns[:assay].should have(4).runs more than assert_equal(4, assigns(:assay).runs.size) and the like.

I have to admit at the time I only vaguely knew what BDD was supposed to be about. The main thing I knew was that BDD was an attempt to change the words we use to talk about automated developer testing/specification/test-driven development (TDD), to make clearer an under-appreciated purpose of such activity, to help developers write code intentionally using better design (loose coupling, etc.). As such, BDD is less a change of practice from TDD (if TDD is practiced correctly) than a clarification of the practice.

After two months using a BDD framework, I have found that while BDD does clarify high-level principles, it still leaves plenty unclear. I have been unit testing for many years now, and I’ve always felt that automated developer testing is a rich subject that takes a long time to fully appreciate, and is not a discipline that can be covered adequately by a few high-level principles. The details matter, because in most realistic testing situations there are always tradeoffs and context-specific considerations that should lead a developer to take one approach over another. That said, BDD is a significant step in the right direction. Over the next week or so I plan to write a series of blog posts examining some of these detailed contexts we’ve encountered at CDD in the context of the principles and tradeoff considerations, with the hope both that these details will be useful to others and that some more experienced BDDers out there will give us some feedback to help us make better choices about how we specify our code.

Before ending this post, I’ll list some of the high-level principles, so I can refer to them over the next week:

  1. Specs should be valuable.
  2. Specs should be acceptable.
  3. Corollary of #2: Some code duplication in specs is ok; the focus should be on clarity/readability/acceptability.
  4. Specs should specify behavior not implementation (the classic interface vs. implementation distinction). Unfortunately, we’ve discovered that “behavior” is still a fairly vague term (leading to some intense discussions within our team), and what “the interface” is varies according to context.
  5. Contexts/examples should set up a particular state (of an object, etc.), and specifications should then describe the behavior that state. This is typically accomplished by setting up state in the setup method or before(:each) block, and then writing many short descriptions of behavior in separate test methods/specs.
  6. Specs should be loosely coupled to application code, so that refactoring app code doesn’t cause lots of tests to break. There is at least a hope here that by specifying behavior/interfaces you’re likely to get loose coupling as well.
  7. Specs should encourage developers to think about interface-centric, just-in-time design of their code. This is TDD/BDD’s major benefit #1.
  8. Finally, I still believe (and here I perhaps depart from some BDDers) that the other major benefit of TDD/BDD is that specs help you verify that your application code works. This is particularly true for small development teams that don’t have a ruthless army of QA people keeping a lid on bugs.

Stay tuned for a specific example later today.

A better LSID, part 1

April 30th, 2007

Those of you familiar with bioinformatics might know about the Life Sciences Identifier (LSID) specification, which describes a URN-based identifier (think “primary key”) for life sciences data objects (genes, proteins, microarray experiments, radiology images, clinical trial study calendars, etc.). I learned a lot about these over my last year and a half at Northwestern during my work with caBIG, because I was heavily involved with developing the first draft proposal for the use of data object identifiers within the caBIG grid. The intended benefits of LSIDs primarily include the following:

  1. location independence — an LSID identifies a resource, not a particular data record on a particular server on the network
  2. global uniqueness — the same LSID should never be used to refer to two distinct objects
  3. local assignment — there must be some easy mechanism for data object creators to assign globally unique LSIDs to their data without the need for onerous bureaucracy
  4. permanence — once assigned, an LSID cannot be reassigned to a different object
  5. semantic opacity — data clients who use LSIDs to refer to data are not supposed to read into the substructure of the LSID and make any conclusions about what the LSID means, other than being a key for a particular data object, e.g. urn:lsid:frank:dog:38922 should not be assumed to have anything to do with Frank or his dog.
  6. data and metadata — in addition to the bytes representing the data object identified, there is a separate “channel” of information, the metadata, which can contain information about relationships between the given data object and other data objects.

and…, some other stuff, see the official spec. LSIDs look like this: urn:lsid:ncbi.nlm.nih.gov:pubmed:9486653. One of the key steps in employing LSIDs on a grid (or in any network accessible way) is the deployment of some kind of resolution service, a server or set of servers that delivers the data for a given LSID to a requesting data client. Also, as a corollary of the “global uniqueness” requirement above, it is important that requesting resolution of a given LSID should always return the same set of bytes, regardless of when the resolution occurs. That is, if you assign an LSID to an object, you assign it to a specific set of bytes representing that object, and that association can never change (although the data provider might stop providing the data). This allows you to cache these objects without worrying about cache invalidation (since the bytes will never change) and more importantly enables reproducible research: a computational researcher can publish results based on a dataset identified by a bunch of LSIDs, and another researcher later can run the same computations on those same LSIDs and expect to arrive at exactly the same results.

Importantly, the metadata referred to by an LSID is not required to be byte-identical, so it has a bit more flexibility to cover some important use cases, as we will see in my next post on the subject.

Slow Rails migrations, Ruby GC, and a MacPorts portfile

January 5th, 2007

At my new job we recently had to use a Rails migration to convert millions of rows of data. Unfortunately the conversion could not be done with SQL, we had to load each row and use Ruby to massage the columns. When we started testing the data conversion on a replica of the real database and measured how long the migration was going to take to complete, we realized it would take almost a week. Looking deeper, the problem proved to be the Ruby garbage collector, which according to posts I’ve read elsewhere, is optimized for short-running scripts and works hard to try to keep the Ruby interpreter’s memory footprint small by running the garbage collector very often when there are lots of objects in memory. By my own measurements it was running after converting approximately every tenth table row, i.e. hundreds of thousands of times during our migration.

Stefan Kaes, the author of railsbench and the RailsExpress blog, has a patch for Ruby that affords you more control over the garbage collector via a few environment variables (essentially allowing Ruby to consume more memory on your machine in exchange for running the GC less often). I didn’t want to apply this patch to my machine (MacBook Pro) without being able to uninstall it, so I wrote a MacPorts portfile based on the one for Ruby 1.8.5. It installs Ruby 1.8.5-p2 and applies Stefan’s GC patch. Since 1.8.5-p2 includes the CGI denial of service fix that is applied as a patch in the 1.8.5 portfile, I removed that patch (ruby-1.8.5-cgi-dos-1.patch). I installed the portfile, and this turned a week of running time into a little less than a day. So, thanks, Stefan. If you’d like to try out the portfile yourself, you can download it here. The gzipped tarfile includes the portfile, required patch files distributed with the original portfile plus the railsbench patch renamed (to patch-gc.c) so that MacPorts can use it properly. If you try it out, please let me know how it works for you.

Note: Just as I was about to publish this, I noticed that about two hours ago MacPorts has released an official 1.8.5-p12 portfile (what happened to p3 through p11?), however it of course does not contain the GC patch. I’ve updated my Portfile based on this new one. Also, instead of installing the gc patch by default, I’ve included it as a variant called railsbench. So, once you’ve unpacked the tar into your local port repository, added that repository to /opt/local/etc/ports/sources.conf, and “portindex”ed the local repository, all you have to do is sudo port install ruby @1.8.5-p12 +railsbench, that last part being the variant. See the INSTALL file in the tarball for more detailed instructions.

Long time no blog

October 3rd, 2006

I’ve been pretty busy over the last several months, which has left little time for blogging. Since June I have moved to San Francisco with my fiancee, found an apartment (no small affair) and started a new job with a small cheminformatics startup as director of software development. So far so good. Probably more about that later.

The two posts below I wrote a long time ago but for some reason never published.