At my new job we recently had to use a Rails migration to convert millions of rows of data. Unfortunately the conversion could not be done with SQL, we had to load each row and use Ruby to massage the columns. When we started testing the data conversion on a replica of the real database and measured how long the migration was going to take to complete, we realized it would take almost a week. Looking deeper, the problem proved to be the Ruby garbage collector, which according to posts I’ve read elsewhere, is optimized for short-running scripts and works hard to try to keep the Ruby interpreter’s memory footprint small by running the garbage collector very often when there are lots of objects in memory. By my own measurements it was running after converting approximately every tenth table row, i.e. hundreds of thousands of times during our migration.
Stefan Kaes, the author of railsbench and the RailsExpress blog, has a patch for Ruby that affords you more control over the garbage collector via a few environment variables (essentially allowing Ruby to consume more memory on your machine in exchange for running the GC less often). I didn’t want to apply this patch to my machine (MacBook Pro) without being able to uninstall it, so I wrote a MacPorts portfile based on the one for Ruby 1.8.5. It installs Ruby 1.8.5-p2 and applies Stefan’s GC patch. Since 1.8.5-p2 includes the CGI denial of service fix that is applied as a patch in the 1.8.5 portfile, I removed that patch (ruby-1.8.5-cgi-dos-1.patch). I installed the portfile, and this turned a week of running time into a little less than a day. So, thanks, Stefan. If you’d like to try out the portfile yourself, you can download it here. The gzipped tarfile includes the portfile, required patch files distributed with the original portfile plus the railsbench patch renamed (to patch-gc.c) so that MacPorts can use it properly. If you try it out, please let me know how it works for you.
Note: Just as I was about to publish this, I noticed that about two hours ago MacPorts has released an official 1.8.5-p12 portfile (what happened to p3 through p11?), however it of course does not contain the GC patch. I’ve updated my Portfile based on this new one. Also, instead of installing the gc patch by default, I’ve included it as a variant called railsbench. So, once you’ve unpacked the tar into your local port repository, added that repository to /opt/local/etc/ports/sources.conf, and “portindex”ed the local repository, all you have to do is sudo port install ruby @1.8.5-p12 +railsbench, that last part being the variant. See the INSTALL file in the tarball for more detailed instructions.