Slow Rails migrations, Ruby GC, and a MacPorts portfile

January 5th, 2007

At my new job we recently had to use a Rails migration to convert millions of rows of data. Unfortunately the conversion could not be done with SQL, we had to load each row and use Ruby to massage the columns. When we started testing the data conversion on a replica of the real database and measured how long the migration was going to take to complete, we realized it would take almost a week. Looking deeper, the problem proved to be the Ruby garbage collector, which according to posts I’ve read elsewhere, is optimized for short-running scripts and works hard to try to keep the Ruby interpreter’s memory footprint small by running the garbage collector very often when there are lots of objects in memory. By my own measurements it was running after converting approximately every tenth table row, i.e. hundreds of thousands of times during our migration.

Stefan Kaes, the author of railsbench and the RailsExpress blog, has a patch for Ruby that affords you more control over the garbage collector via a few environment variables (essentially allowing Ruby to consume more memory on your machine in exchange for running the GC less often). I didn’t want to apply this patch to my machine (MacBook Pro) without being able to uninstall it, so I wrote a MacPorts portfile based on the one for Ruby 1.8.5. It installs Ruby 1.8.5-p2 and applies Stefan’s GC patch. Since 1.8.5-p2 includes the CGI denial of service fix that is applied as a patch in the 1.8.5 portfile, I removed that patch (ruby-1.8.5-cgi-dos-1.patch). I installed the portfile, and this turned a week of running time into a little less than a day. So, thanks, Stefan. If you’d like to try out the portfile yourself, you can download it here. The gzipped tarfile includes the portfile, required patch files distributed with the original portfile plus the railsbench patch renamed (to patch-gc.c) so that MacPorts can use it properly. If you try it out, please let me know how it works for you.

Note: Just as I was about to publish this, I noticed that about two hours ago MacPorts has released an official 1.8.5-p12 portfile (what happened to p3 through p11?), however it of course does not contain the GC patch. I’ve updated my Portfile based on this new one. Also, instead of installing the gc patch by default, I’ve included it as a variant called railsbench. So, once you’ve unpacked the tar into your local port repository, added that repository to /opt/local/etc/ports/sources.conf, and “portindex”ed the local repository, all you have to do is sudo port install ruby @1.8.5-p12 +railsbench, that last part being the variant. See the INSTALL file in the tarball for more detailed instructions.

Agile bioinformatics paper

June 18th, 2006

I don’t know why I didn’t post this before, but at the end of last month BMC Bioinformatics posted a provisional version of our paper on agile software methods in bioinformatics. The good news is people seem to be reading it. It is the journal’s #3 most viewed paper in the last thirty days! Give it a read and let me know what you think.

schemamule 1.0 released

May 23rd, 2006

We finally got around to open sourcing schemamule, a utility we wrote in the bioinformatics core at Northwestern. Schemamule copies the schema from one database to another (right now we support copying from Oracle to HSQLDB). We have used this tool for over two years now to set up an in-memory HSQDLB database to use for unit testing. This has at least a couple nice benefits:

  • Database-dependent tests (e.g. DAO tests) run much faster than when they are run against a database accessible only over the network, especially when working from home
  • Database-dependent unit tests can be run even if you aren’t on the network at all, very nice if you’re working on a plane, in an airport, or at the beach

We use schemamule primarily via its schemacopy Ant task, which has some nice configuration options if you just want to copy a subset of the table definitions from the schema, if you want to copy views, sequences, etc. Note that table records are never copied.

One of our apps does a number of funky custom-SQL queries for reporting (we have not yet taken the plunge and created a reporting data warehouse DB), and this meant that we had to fake a number of Oracle-specific features in HSQLDB. For example, in HSQLDB there is no DUAL table. We fake it by creating a table called DUAL that has one (and only one) row (the column names and record content being irrelevant). We also provide an HsqldbLibrary class to define other functions (e.g. TRUNC) not available in HSQLDB.

Anyway, if you do a lot of automated unit testing in Java, use Ant to automate your builds, and have slow database-dependent tests, try out schemamule, see if it helps, and let us know. Usage examples are available on the project website.

Object validation

April 21st, 2006

A while back I was talking with a colleague about domain object validation for an application built using the Spring framework. My bioinformatics team at the CFG has always put the validation logic into the business object itself. In the case of command objects (the objects that hold form state), this means we create a custom interface, Validatable, that declares a single method, void validate(Errors errors) (the Errors interface is part the Spring validation machinery, representing a smart hash of errors by object property). Calling this method records any validation issues into the Errors object passed to validate(). We also need a simple Validator called ValidatableValidator that takes a Validatable and calls validate(errors) on it.

This always seemed the right approach to me, because after all in object-oriented design you should always move operations that act on data close to the data themselves, i.e. put the validate method that checks for valid data state in the object that encapsulates that data, i.e. tell, don’t ask.

My colleague was of the opposite opinion. He felt that command objects should just be dumb objects (”objects” with only data and accessor methods, no business logic), and all validation logic should be kept in separate validator objects. When I protested that dumb objects were an anathema in object-oriented code, he asked me why Spring provided a Validator interface and not a Validatable interface. I had to admit I was a little dumbfounded. I generally think Spring is well-written, but it did seem they were explicitly recommending splitting validation behavior out of the object being validated.

Thinking about it some more, I’ve decided that while I still generally think that validation logic should stay in the object being validated, there are cases when you’d want to split it out.

  • when the validated object’s class already has lots of behavior, is getting too big and needs to be split up into a few collaborating classes, closely coupled to one another but not to other classes
  • when the same validation logic occurs in multiple business objects (e.g. this string is not null, empty or whitespace) and you’d like to simplify maintenance and keep that logic in only one place, though this can just as easily be done using a static method, e.g. StringUtils.isBlank(text)
  • when validation depends on capabilities foreign to the object itself (e.g. checking uniqueness of a username via a data access object (DAO) — it often makes good sense to split the responsibility for data access out of domain objects and into DAOs)

Another example of the last case is anything that has to interact with the network. You don’t want your domain objects depending on a network, especially if you want unit tests to run rapidly and reliably without access to a network (e.g. when you’re sitting in an airport that charges $10/hour for wireless internet access). In the neuromice.org code, when we validate mouse data pulled in from one of the member sites, we check all links provided to make sure they work and refer to the expected content, and we do this using an external validator class. However, we avoid violating the “tell, don’t ask” principle by using something similar to the visitor pattern. The object being validated has an accept(ValidationVisitor validator) method. The validator passed in has methods like validateLink(String url), etc. In our unit tests we pass in either a mock or a stub ValidationVisitor that does not actually hit the network. This allows us to separate out responsibilities nicely without requiring brittle procedural code sitting in an unrelated class like

class SomeFormValidator implements Validator {
  public void validate(Object object, Errors errors) {
    // oddly the cast is still necessary in Spring 2.0 M3
    MutantLine line = (MutantLine) object; 
    if (!LinkValidator.isValidLink(line.getGeneLink()) {
      errors.addError("etc. etc. etc."); 
    }
    ...
  }
}

Instead the code is something like

class MutantLine { ...
  public void accept(ValidationVisitor validator) {
    validator.validateLink(this.geneLink, "gene link");
    associatedObject.accept(validator);
    ...
  }
}

Note that the ValidationVisitor encapsulates the Errors object.

As an interesting aside, in Ruby on Rails, validation is baked into domain objects, including validation that depends on data access. This is because Rails’ ActiveRecord, unsurprisingly, follows the Active Record pattern, in which the object is both a domain object and a DAO.

e.printStackTrace() is not for you

March 21st, 2006

While reading through another team’s Java codebase recently, I came across a disturbing proclivity for code like this

public SomeType aMethod() {
  SomeType result = null;
  try {
    anObject.thatDoesSomething();
    result = anObject.getSomethingElse();
  } catch (SomeTypeOfException e) {
    e.printStackTrace();
    // or, sometimes, log.error(e);
  }
  return result;
}

This is called swallowing the exception. The only way to know that an exception occurred is to have access to the stderr stream of the process (or, if logging was used, the logs). Since the software product referred to above runs inside an application server, clients never get to see this information. The method just returns null. Sometimes it’s obvious that null indicates a problem. Other times, however, the client may wrongly interpret null as absence of information.

This brings me to what I think should be a rule for Java code: almost always, printStackTrace() is not for us to use. Its only utility is for toy programs and logging libraries. When presented with exceptions that your code cannot handle itself, you should rethrow those exceptions up the call stack, so that the server container can report problems to any client and to the application logs.

Sometimes you have to wrap exceptions in another type of exception (e.g. a ServletException) or a custom runtime exception (we usually create a class SystemException for this), because the callbacks your server gives you do not declare that they throw the right exception type. This is one of the reasons I like using Spring MVC; the controller callbacks all declare that they throw Exception (although I’m sure most other modern Java web frameworks share this property these days).

So, remember (if you didn’t already know, which most of you probably do): if you see e.printStackTrace() in your code, it likely means your code has a problem you need to fix. Even throwing new RuntimeException(e) is better than swallowing the exception.

Update: Reading this later, I realized I left out an important point. The above assumes you actually need the try/catch block. If SomeTypeOfException were an unchecked (i.e. runtime) exception, you could just let it bubble up the call stack. If it’s a checked exception, in order to let it bubble up, you have to declare the exception in the method signature. This is unadvisable if the exception makes no sense for the method (e.g. declaring that an Employee’s getPhoneNumber() method throws SQLException). In this case it’s preferable either to wrap the exception in a checked exception that makes sense in the method signature or in an unchecked exception that does not need to be declared.