<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Inforbiomatica &#187; Informatics</title>
	<atom:link href="http://www.moseshohman.com/blog/category/informatics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.moseshohman.com/blog</link>
	<description>software development, informatics, etc.</description>
	<lastBuildDate>Fri, 06 Aug 2010 07:22:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>R makes NYT</title>
		<link>http://www.moseshohman.com/blog/2009/01/07/r-makes-nyt/</link>
		<comments>http://www.moseshohman.com/blog/2009/01/07/r-makes-nyt/#comments</comments>
		<pubDate>Wed, 07 Jan 2009 07:34:50 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/?p=60</guid>
		<description><![CDATA[Nice: http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html]]></description>
			<content:encoded><![CDATA[<p>Nice: <a href="http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html">http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2009/01/07/r-makes-nyt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Pair programming and microarrays</title>
		<link>http://www.moseshohman.com/blog/2008/07/09/pair-microarrays/</link>
		<comments>http://www.moseshohman.com/blog/2008/07/09/pair-microarrays/#comments</comments>
		<pubDate>Wed, 09 Jul 2008 09:46:13 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Open R&D]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[microarrays pairing]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/?p=59</guid>
		<description><![CDATA[Yesterday I met with folks at Lawrence Berkeley labs. The PI entered the room, full of energy and clearly ducking briefly out of the fray to speak with us. Part of the discussion revolved around microarray experiments. We&#8217;ve all heard &#8230; <a href="http://www.moseshohman.com/blog/2008/07/09/pair-microarrays/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Yesterday I met with folks at Lawrence Berkeley labs. The PI entered the room, full of energy and clearly ducking briefly out of the fray to speak with us. Part of the discussion revolved around microarray experiments. We&#8217;ve all heard about how notoriously difficult it is to reproduce microarray experiments. People have proposed minimum information standards (really they&#8217;re guidelines) to combat this problem, and we&#8217;ve also all heard that often these standards aren&#8217;t enough. Even if people are following the guidelines, inevitably a crucial piece of information isn&#8217;t obviously critical and therefore isn&#8217;t communicated.</p>

<p>The PI noted that he has seen it to be helpful when more than one lab conducts an experiment, so that each can help the other avoid finicky and/or tacit experimental conditions that would prevent others from reproducing their results. I have wondered for some time (and for the case of microarrays in particular) whether the practice of &#8220;pair programming&#8221; that we use in software development would be more helpful than minimum information standards to increase the reproducibility of complex experiments. The problem with this, as the PI pointed out, is that duplicating every experiment can get expensive, and in the world of soft money (especially today&#8217;s world), people are always looking for ways to make the research dollar go farther. The possible long term efficiency of duplicating some efforts to increase data value and reduce a tendency to go down blind alleys might not be easy to quantify, and thus not easy to weigh quantitatively against the immediate penalty of &#8220;getting half as much work done&#8221;. (That&#8217;s certainly true in software.)</p>

<p>The PI pointed out that even if direct duplication was too expensive, he still advocated some kind of collaboration on experiments. In particular he advocated getting people together in the same room to look at the experiment together <em>as it was being performed</em>, so that the collaborator might catch important things that weren&#8217;t immediately apparent to the person performing the experiment. This, at most, only costs a small amount of travel funds.</p>

<p>I asked the PI if others shared his views, and he said that most of the larger microarray efforts had some sort of distributed work going on, but he wasn&#8217;t sure that this idea had been formalized anywhere.</p>

<p>I&#8217;m interested in this not only because of its parallel with software work, but also because I work for a <a target="_blank" href="http://www.collaborativedrug.com">company focused on facilitating collaborative science</a>. I&#8217;m very interested in the different forms that scientific collaboration can take, and how best to help them along.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2008/07/09/pair-microarrays/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Z factor refactored</title>
		<link>http://www.moseshohman.com/blog/2007/11/11/z-factor-refactored/</link>
		<comments>http://www.moseshohman.com/blog/2007/11/11/z-factor-refactored/#comments</comments>
		<pubDate>Sun, 11 Nov 2007 23:52:48 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/2007/11/11/51</guid>
		<description><![CDATA[I recently reread the original Z factor paper (Zhang et al). The Z factor is a measure of assay reliability and comes in two flavors: the Z&#8217; factor, based entirely based on controls (those with and without the desired effect); &#8230; <a href="http://www.moseshohman.com/blog/2007/11/11/z-factor-refactored/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I recently reread the original Z factor paper (<a href="http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&amp;Cmd=ShowDetailView&amp;TermToSearch=10838414">Zhang et al</a>). The Z factor is a measure of assay reliability and comes in two flavors: the Z&#8217; factor, based entirely based on controls (those with and without the desired effect); and the Z factor, based on experimental data compared with the controls that should have the desired effect.</p>

<p>Rereading a paper months later often makes you wonder whether you read the paper at all the first time. This reading really clarified for me what the Z factor is, that it is not just for high-throughput screening, and raised a number of questions (especially after discussion with colleagues) not addressed in the paper.</p>

<p>The Z factor is the ratio of the &#8220;separation band&#8221; of the data to the assay dynamic range. A picture helps:</p>

<p><img src="/blog/wp-content/uploads/2007/11/z-factor.png" alt="separation band image" title="Z factor: the separation band" /></p>

<p>where &mu;<sub>+</sub> is the mean of the positive controls (in this case the controls with desired effect), &mu;<sub>s</sub> is the mean of the data, &sigma;<sub>+</sub> is the standard deviation of the positive controls, etc. The assay dynamic range in this diagram is &mu;<sub>+</sub> &#8211; &mu;<sub>s</sub>. The screening window is then (&mu;<sub>+</sub> &#8211; &mu;<sub>s</sub>) &#8211; (3&sigma;<sub>+</sub> + 3&sigma;<sub>s</sub>), and the ratio of this to the dynamic range is the Z factor = 1 &#8211; (3&sigma;<sub>+</sub> + 3&sigma;<sub>s</sub>)/(&mu;<sub>+</sub> &#8211; &mu;<sub>s</sub>).</p>

<p>(If you&#8217;re reading this in an RSS reader, the story continues on my website.)</p>

<p><span id="more-51"></span></p>

<p><a href="http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&amp;Cmd=ShowDetailView&amp;TermToSearch=10838414">Zhang et al</a> go on to describe desirable values of Z: Z = 1 is an &#8220;ideal assay&#8221; (the standard deviations are negligible compared to the difference between the means), 1 > Z ≥ 0.5 is an &#8220;excellent assay&#8221;, 0.5 > Z > 0 is a &#8220;double assay&#8221;, Z = 0 is a &#8220;yes/no assay&#8221; (no separation band, the two 3&sigma; regions touch), and for Z &lt; 0 &#8220;screening is essentially impossible&#8221;. Note that when the two distributions completely overlap (&mu;<sub>s</sub> = &mu;<sub>+</sub>), then Z is -&infin;.</p>

<p>As I mentioned above, the Z factor&#8217;s usefulness is not restricted to high-throughput screening assays. Indeed, it can be applied to any assay that measures a number of experimental subjects in an identical way and measures control values. However, when discussing application to assay optimization, the paper does point out that use of the Z factor requires &#8220;relatively large data sets&#8221;.</p>

<p>Three major questions arise:</p>

<ol>
<li>What do these significant values of Z mean, especially 0.5? What is a &#8220;double&#8221; or &#8220;yes/no&#8221; assay?</li>
<li>How large is a &#8220;relatively large data set&#8221;, i.e. how large does your data set need to be to use the Z factor?</li>
<li>Why is the range of of Z -&infin; to 1? Is there another parameter with a more intuitive range?</li>
</ol>

<p>A nice companion reading for deeper understanding is the <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&amp;Cmd=ShowDetailView&amp;TermToSearch=17218666">paper</a> published in March 2007 by <a href="http://www.stat.brown.edu/~ysui/">Sui</a> and <a href="http://www.stat.brown.edu/~zwu/">Wu</a>. They take on question #1, by examining the statistical power of an assay at different Z factor values. The statistical power, in the case of a drug screening assay, is the probability that an active compound is scored as a hit (i.e. the probability of &#8220;true positives&#8221;). Z factor is calculated without referring to the hit threshold, the value beyond which a compound is scored as a hit. Often people score all outliers three standard deviations outside the mean as &#8220;hits&#8221; (in our diagram above, this would be any measurements falling above &mu;<sub>s</sub> + 3&sigma;<sub>s</sub>, i.e. within the separation band or above). Sui and Wu show that if the standard deviations of sample data and controls are equal, then a Z factor of 0.5 corresponds to a statistical power of 0.999, i.e. there is only 0.1% chance that an active compound is not scored as a hit.</p>

<p>However, they also show that if the standard deviations are not equal, then interpreting the Z factor becomes considerably trickier. They also show that although the Z factor calculation does not necessarily rely on the error distributions being normal, for non-normal error distributions (where the sample (or the control data) is not well described by the normal distribution N(&mu;<sub>s</sub>, &sigma;<sub>s</sub><sup>2</sup>)) the Z factor does a poor job of describing the reliability of the assay, demonstrated by the fact that the Z factor is different if non-normally distributed data is transformed to be closer to normal.</p>

<p>Sui and Wu suggest caution when interpreting Z (and Z&#8217;) factor values, and recommend that analysts confirm normality of the data (transforming if necessary) and calculate the statistical power corresponding to the distributions of the sample/control data
 and the hit threshold to get a more reliable measure of assay reliability.</p>

<p>Humorous side note: Sui and Wu interpret &#8220;double assay&#8221; to mean &#8220;doable assay&#8221;. Who knows.</p>

<p>As for question #2 (&#8220;how large does a data set need to be to get a reliable estimate of Z factor?&#8221;), this can be calculated from the standard error of the estimators used to calculate the means and standard deviations. That is, when calculating Z factor, one can&#8217;t actually use the real means and standard deviations of the underlying distributions of the samples and controls, one can only estimate these quantities by making a number of measurements of the samples and controls. I plan to calculate these and publish them here at some point, but any statistician can do the same (probably more efficiently and correctly than I).</p>

<p>Finally, question #3: <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&amp;Cmd=ShowDetailView&amp;TermToSearch=10838414">Zhang et al</a>&#8216;s objective was to develop a dimensionless constant that took three distributional parameters into account: the difference of the means of the samples and controls, the variability of the controls and the variability of the sample data. There are other ways to combine these parameters into dimensionless constants that are different from the Z factor. For instance, one could calculate the ratio of the separation band to the sum of (3&sigma;<sub>+</sub> + 3&sigma;<sub>s</sub>), call this parameter C. C varies between -&infin; and +&infin;, and if &sigma; is the same for controls and samples, then C = -1 when Z = -&infin;, C = 0 when Z = 0, C = 1 when Z = 0.5, and C = +&infin; when Z = 1.</p>

<p>This gets rid of the weird &#8220;0.5&#8243; and the upper limit of 1, but I think that actually the Z factor is a more accurate reflection of reliability. The reason is that the reliability of an assay with Z of 0.9 vs. 1.0 is really quite small, even if the dynamic range in these two cases is very different, because 0.9 is already good enough. By contrast, the difference between the reliability of an assay with Z = 0 and Z = -&infin; is huge, because we go from having the data variability ranges ([&mu; - 3&sigma;, &mu; + 3&sigma;] with appropriate subscripts) touch to having the two distributions completely overlap. By contrast C would only vary from 0 and -1 between these two cases.</p>

<p>I may retouch the explanation above for clarity at some point, especially if people ask questions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2007/11/11/z-factor-refactored/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>A better LSID, part 1</title>
		<link>http://www.moseshohman.com/blog/2007/04/30/a-better-lsid-part-1/</link>
		<comments>http://www.moseshohman.com/blog/2007/04/30/a-better-lsid-part-1/#comments</comments>
		<pubDate>Mon, 30 Apr 2007 06:55:23 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/2007/04/30/39</guid>
		<description><![CDATA[Those of you familiar with bioinformatics might know about the Life Sciences Identifier (LSID) specification, which describes a URN-based identifier (think &#8220;primary key&#8221;) for life sciences data objects (genes, proteins, microarray experiments, radiology images, clinical trial study calendars, etc.). I &#8230; <a href="http://www.moseshohman.com/blog/2007/04/30/a-better-lsid-part-1/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Those of you familiar with bioinformatics might know about the <a href="http://lsid.sourceforge.net/">Life Sciences Identifier</a> (LSID) specification, which describes a URN-based identifier (think &#8220;primary key&#8221;) for life sciences data objects (genes, proteins, microarray experiments, radiology images, clinical trial study calendars, etc.). I learned a lot about these over my last year and a half at Northwestern during my work with <a title="Cancer Biomedical Informatics Grid">caBIG</a>, because I was heavily involved with developing the first draft proposal for the use of data object identifiers within the caBIG grid. The intended benefits of LSIDs primarily include the following:</p>

<ol>
  <li><strong>location independence</strong> &#8212; an LSID identifies a resource, not a particular data record on a particular server on the network</li>
  <li><strong>global uniqueness</strong> &#8212; the same LSID should never be used to refer to two distinct objects</li>
  <li><strong>local assignment</strong> &#8212; there must be some easy mechanism for data object creators to assign globally unique LSIDs to their data without the need for onerous bureaucracy</li>
  <li><strong>permanence</strong> &#8212; once assigned, an LSID cannot be reassigned to a different object</li>
  <li><strong>semantic opacity</strong> &#8212; data clients who use LSIDs to refer to data are not supposed to read into the substructure of the LSID and make any conclusions about what the LSID means, other than being a key for a particular data object, e.g. urn:lsid:frank:dog:38922 should not be assumed to have anything to do with Frank or his dog.</li>
  <li><strong>data and metadata</strong> &#8212; in addition to the bytes representing the data object identified, there is a separate &#8220;channel&#8221; of information, the metadata, which can contain information about relationships between the given data object and other data objects.</li>
</ol>

<p>and&#8230;, some other stuff, see the <a href="http://www.omg.org/cgi-bin/doc?dtc/04-10-08">official spec</a>. LSIDs look like this: <code>urn:lsid:ncbi.nlm.nih.gov:pubmed:9486653</code>. One of the key steps in employing LSIDs on a grid (or in any network accessible way) is the deployment of some kind of resolution service, a server or set of servers that delivers the data for a given LSID to a requesting data client. Also, as a corollary of the &#8220;global uniqueness&#8221; requirement above, it is important that requesting resolution of a given LSID should always return the same set of bytes, regardless of when the resolution occurs. That is, if you assign an LSID to an object, you assign it to a specific set of bytes representing that object, and that association can never change (although the data provider might stop providing the data). This allows you to cache these objects without worrying about cache invalidation (since the bytes will never change) and more importantly enables reproducible research: a computational researcher can publish results based on a dataset identified by a bunch of LSIDs, and another researcher later can run the same computations on those same LSIDs and expect to arrive at exactly the same results.</p>

<p>Importantly, the metadata referred to by an LSID is not required to be byte-identical, so it has a bit more flexibility to cover some important use cases, as we will see in my next post on the subject.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2007/04/30/a-better-lsid-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Agile bioinformatics paper</title>
		<link>http://www.moseshohman.com/blog/2006/06/18/agile-bioinformatics-paper/</link>
		<comments>http://www.moseshohman.com/blog/2006/06/18/agile-bioinformatics-paper/#comments</comments>
		<pubDate>Mon, 19 Jun 2006 03:22:01 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/2006/06/18/35</guid>
		<description><![CDATA[I don&#8217;t know why I didn&#8217;t post this before, but at the end of last month BMC Bioinformatics posted a provisional version of our paper on agile software methods in bioinformatics. The good news is people seem to be reading &#8230; <a href="http://www.moseshohman.com/blog/2006/06/18/agile-bioinformatics-paper/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I don&#8217;t know why I didn&#8217;t post this before, but at the end of last month BMC Bioinformatics posted a provisional version of <a href="http://www.biomedcentral.com/1471-2105/7/273/abstract">our paper on agile software methods in bioinformatics</a>. The good news is people seem to be reading it. It is the journal&#8217;s <a href="http://www.biomedcentral.com/bmcbioinformatics/mostviewed/">#3 most viewed</a> paper in the last thirty days! Give it a read and let me know what you think.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2006/06/18/agile-bioinformatics-paper/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SciLnk</title>
		<link>http://www.moseshohman.com/blog/2006/04/20/scilnk/</link>
		<comments>http://www.moseshohman.com/blog/2006/04/20/scilnk/#comments</comments>
		<pubDate>Fri, 21 Apr 2006 04:51:40 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/2006/04/20/32</guid>
		<description><![CDATA[Some friends of mine have been working on an interesting new startup called SciLnk. Essentially, the site is LinkedIn for life scientists. It will also allow you to browse pubmed abstracts via the network of authors. Often when you&#8217;re researching &#8230; <a href="http://www.moseshohman.com/blog/2006/04/20/scilnk/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Some friends of mine have been working on an interesting new startup called <a href="http://www.scilnk.com/">SciLnk</a>. Essentially, the site is <a href="http://www.linkedin.com/">LinkedIn</a> for life scientists. It will also allow you to browse pubmed abstracts via the network of authors. Often when you&#8217;re researching a subject you want to read all the papers published by a particular person and his or her collaborators on the topic, and it&#8217;s not as easy as it should be to collect everything via PubMed. SciLnk hopes to leverage the people network to improve the literature browsing experience and vice versa. There are lots of other interesting directions to take such a resource: improving the conference-going experience, grant searching/suggestions, job searching, etc.</p>

<p>They recently posted <a href="http://blog.scilnk.com/articles/2006/04/17/scilnk-screenshots">screenshots</a> (and, more recently, <a href="http://blog.scilnk.com/articles/2006/04/19/scilnk-internal-development">these</a>) of the product under development, and you can sign up to get in on internal beta testing at <a href="http://www.scilnk.com/">scilnk.com</a>.</p>

<p>Full disclosure: I was personally working with the SciLnk team in its early days, but quickly realized I had bitten off more than I could chew, what with having a full-time day job and not living in Boston with the rest of the crew. I no longer have any financial interest in the company, I just think it&#8217;s a cool idea.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2006/04/20/scilnk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A few more thoughts on communication, tech posts to come</title>
		<link>http://www.moseshohman.com/blog/2006/03/21/a-few-more-thoughts-on-communication-tech-posts-to-come/</link>
		<comments>http://www.moseshohman.com/blog/2006/03/21/a-few-more-thoughts-on-communication-tech-posts-to-come/#comments</comments>
		<pubDate>Tue, 21 Mar 2006 23:17:07 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/2006/03/21/29</guid>
		<description><![CDATA[I discovered recently on postgenomic.com that mine is one of the wordiest life science blogs around, so I&#8217;m going to try to be a little pithier. We&#8217;ll see if I can constrain myself. In my last post I argued for &#8230; <a href="http://www.moseshohman.com/blog/2006/03/21/a-few-more-thoughts-on-communication-tech-posts-to-come/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I discovered recently on <a href="http://postgenomic.com">postgenomic.com</a> that mine is one of the wordiest life science blogs around, so I&#8217;m going to try to be a little pithier. We&#8217;ll see if I can constrain myself.</p>

<p>In my last post I argued for the central importance of effective collaboration and communication in biomedical informatics. I wanted to list a few things that have worked for my teams in those areas. At Northwestern we worked on two projects. <a href="www.neuromice.org">Neuromice.org</a> is a phenotype database and virtual storefront for the mutant lines produced by three neurologically-focused whole genome mutagenesis efforts at Northwestern, the <a href="www.jax.org">Jackson Laboratory</a> and the <a href="http://www.tnmouse.org/">Tennessee Mouse Genome Consortium</a>. The other application, MouseDB, is an intranet (i.e. you can&#8217;t see it) colony and phenotyping management system for the mice under study at Northwestern (10,000 mice/year when we were in full swing). Each project had different challenges, but here are a few things I learned from those experiences. Some are pretty standard agile ideas, others less so.</p>

<ul>
    <li>Each distinct customer/user subgroup should appoint a representative who speaks for that subgroup in all discussions of feature definition and priority. Keep the number of subgroups as small as possible (ideally, one). This greatly reduces the uncertainty and difficulty of scope decisions.</li>
    <li>Some users in the group might have no reason to use your software. Make this fact explicit, and don&#8217;t factor their interests into the product.</li>
    <li>Be completely open with your user community. Give them the opportunity to know everything you&#8217;re working on, and the reasons for (and the opportunity to contribute to) any decisions made about features going into the product.</li>
    <li>A development team should avoid making any decisions about scope or feature priority. Emphasize to users that it is in their power to steer the software toward the greatest possible utility. Technical improvements are a sticking point here, but we&#8217;ve found if you make a good argument for them, users understand their value and will prioritize them appropriately.</li>
    <li>If you let academics&#8217; busy schedules eat away at your face time with them, you will eventually suffer for it. Be creative.</li>
</ul>

<p>I think I&#8217;ve reached my word limit. Over the next couple weeks I&#8217;m going to let loose a flurry of technical posts on various topics that have been on my mind lately.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2006/03/21/a-few-more-thoughts-on-communication-tech-posts-to-come/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The promise of bioinformatics</title>
		<link>http://www.moseshohman.com/blog/2006/03/03/promise-of-bioinformatics/</link>
		<comments>http://www.moseshohman.com/blog/2006/03/03/promise-of-bioinformatics/#comments</comments>
		<pubDate>Sat, 04 Mar 2006 00:14:45 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/2006/03/03/17</guid>
		<description><![CDATA[Now and again, you hear the concern that bioinformatics will fail to &#8220;fulfill its promise&#8221;. I find this statement to be both a bit scary and a little preposterous. Scary because the success of the field will have an effect &#8230; <a href="http://www.moseshohman.com/blog/2006/03/03/promise-of-bioinformatics/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Now and again, you hear the concern that bioinformatics will fail to &#8220;fulfill its promise&#8221;. I find this statement to be both a bit scary and a little preposterous. Scary because the success of the field will have an effect on my own personal success. Preposterous because, well, the advantages of high throughput computation, structured biological databases, etc. are so abundantly clear, how could bioinformatics possibly fail?</p>

<p>There are certainly success stories. Important approaches to biological analysis in use today were not available ten years ago. I think some of the frustration arises because, in spite of these successes, some users feel that much of the biomedical software being churned out today just isn&#8217;t quite useful enough to justify the cost (in money and time) of using it. Assuming this is the case, we must then ask, why? Is biological analysis too difficult to capture in a set of machine instructions? Are bioinformaticians just a bunch of good for naughts?</p>

<p>The latter response, though intended to be humorous, is actually probably more common among biologists than we bioinformaticians might like. This answer also suggests a dichotomy too often encountered in organizations undertaking software development, that of users vs. developers, or, in the domain-specific vernacular, scientists/clinicians vs. bioinformaticians. Users, angered by the failure of software, blame developers for not working hard enough, for not listening, for being idiots (you know they think this sometimes), etc. Developers, for their part, are no more charitable. Developers blame users for not knowing what they want, for using software inconsistently, for not being able to work around seemingly trivial problems, and of course for being idiots. Much of the naturally occurring tension churned up in the process of building software finds its release in similar fits of whinging by one camp or the other. When we&#8217;re more reasonable, we are still honestly perplexed by the question, why isn&#8217;t this working out better?</p>

<p>In the past I&#8217;ve heard people say that bioinformaticians just need to be trained very well in both biology and computer science, that this would alleviate a lot of the problem of getting them to build biologically relevant and valuable software. This may work in some cases. A couple weeks ago I was having lunch with a biologist colleague, and he told me that I needed to learn the biology better, otherwise I would always be beholden to biologists to come up with interesting problems to work on. I see what he was getting at, but I don&#8217;t think that is the solution. The truth is both biology and software development are so complex that I don&#8217;t think it&#8217;s possible to gather into one person&#8217;s head all the expertise necessary to produce all the products that bioinformatics promises. Rather, I think the answer is better communication between biology experts and software experts.</p>

<p>Rather than focusing solely on algorithms and technologies, we must focus more on the people side of building biomedical software. You read this very comment in the bio-IT business literature sometimes, taken from the mouths of venture capitalists, in the form of something like &#8220;companies can no longer expect to get funding simply for having cool technology&#8221;, their software has to solve a biologically relevant problem, i.e. it has to be useful. I am reminded of something I heard during a talk at an agile conference in New Orleans in 2003. Josh Kerievsky said, and I paraphrase, &#8220;Some think we&#8217;re in the technology business, but we&#8217;re not. We&#8217;re in the communication business&#8221;. Communicating effectively with users is surprisingly difficult to do, and requires wisdom and dedication to get right. Effective communication is a much bigger challenge than the algorithms and the technology usually are. What&#8217;s more, it&#8217;s a two-way street, and both developers and scientists have to be committed.</p>

<p>I think biologists and bioinformaticians want to communicate better. I think part of the problem is organizational. For instance, at Northwestern, like at any university, it&#8217;s very difficult to get good office space. We were stuck in a converted greenhouse for the three years I was in Evanston, on the top floor of the Hogan building. At first we were in the same building as many of the biologists with whom we worked, but not all of them. Over the course of the project, a nice new building opened up, and a number of them moved into the new building. We typically saw these people once a week, if that, at weekly user meetings. We tried to keep contact with them by going to see them individually on a periodic basis (although I think we probably could have done a better job at that). But these people are very busy, and it&#8217;s difficult to fit into their schedules. As my biologist colleague pointed out at lunch, it would have been ideal if we could have had shared office space, so that spontaneous discussions would have been more frequent. I think the software we built would have become more useful as a result.</p>

<p>There are many other organizational problems (the difficulty of funding ongoing bioinformatics groups on grants; the lack of a history of operational management positions within academic groups). I imagine some of these get easier in industry. But these are not the only problems. I also think that we don&#8217;t yet focus enough attention on communication, on doing it right, and on getting help from outside to do it right. Most of us assume that, hey, we&#8217;re smart people, we should be able to communicate. Part of the problem is we&#8217;re too smart. We&#8217;re trying to communicate complex information, information we&#8217;re used to communicating with peers within our field who usually understand us even if we&#8217;re not clear. The level of tacit knowledge in biology, like in software, is very high, and it often takes people outside the field quite a while to get a feel for a problem.</p>

<p>This post is long enough. We talk about some of these communication issues in the paper David Kane and others and I are trying to get published on agile software methods in bioinformatics. I think agile principles help, at the very least because they get software engineers to focus less on impressing people with their bulletproof processes and more on people and on communication that works (i.e. <a href="http://www.agilemanifesto.org">&#8220;Individuals and interactions over processes and tools&#8221;</a>). Although our application domain is more technical, the general software industry and its customers have been grappling with these issues for years, and there are a number of very smart and capable people out there who could really help transfect bioinformatics groups with good approaches for tackling these problems.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2006/03/03/promise-of-bioinformatics/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Northwestern to host biomedical software development meeting</title>
		<link>http://www.moseshohman.com/blog/2006/01/07/northwestern-to-host-biomedical-software-development-meeting/</link>
		<comments>http://www.moseshohman.com/blog/2006/01/07/northwestern-to-host-biomedical-software-development-meeting/#comments</comments>
		<pubDate>Sat, 07 Jan 2006 19:35:19 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/?p=22</guid>
		<description><![CDATA[Good news! Northwestern University will host the second BRIITE meeting in 2006, to take place sometime around August or September of this year. We are planning on offering a second component to the meeting exclusively focused on software development issues, &#8230; <a href="http://www.moseshohman.com/blog/2006/01/07/northwestern-to-host-biomedical-software-development-meeting/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Good news! Northwestern University will host the second <a href="http://www.briite.org/"><acronym title="Biomedical Research Institution Information Technology Exchange">BRIITE</acronym></a> meeting in 2006, to take place sometime around August or September of this year. We are planning on offering a second component to the meeting exclusively focused on software development issues, and for that we will solicit a wider audience than the one that typically attends BRIITE. Instead of just managerial folks, we also want the software developers, testers, scientists, etc., because this will make for a much more lively, interesting and meaningful discussion.</p>

<p>This idea is still very new, but I will keep you posted on developments as things unfold.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2006/01/07/northwestern-to-host-biomedical-software-development-meeting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BRIITE 2005 La Jolla</title>
		<link>http://www.moseshohman.com/blog/2005/12/09/briite-2005-la-jolla/</link>
		<comments>http://www.moseshohman.com/blog/2005/12/09/briite-2005-la-jolla/#comments</comments>
		<pubDate>Fri, 09 Dec 2005 16:37:21 +0000</pubDate>
		<dc:creator>Moses</dc:creator>
				<category><![CDATA[Informatics]]></category>

		<guid isPermaLink="false">http://www.moseshohman.com/blog/?p=21</guid>
		<description><![CDATA[Last month I attended the BRIITE meeting for the first time. As its website says, the meeting&#8217;s mission is to: Establish personal contacts, bringing together those responsible for research computing activities at biomedical research institutions Identify and document common problems &#8230; <a href="http://www.moseshohman.com/blog/2005/12/09/briite-2005-la-jolla/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Last month I attended the <a href="http://www.briite.org/"><acronym title="Biomedical Research Institution Information Technology Exchange">BRIITE</acronym></a> meeting for the first time. As its website says, the meeting&#8217;s mission is to:</p>

<blockquote>
<ul>
<li> Establish personal contacts, bringing together those responsible for research computing activities at biomedical research institutions</li>
<li>Identify and document common problems and interests</li>
<li>Seek opportunities for partnership / consortium activities</li>
<li>Identify common issues that should be brought to attention of home institutions, government and other funding agencies</li>
</ul>
</blockquote>

<p>One of the things I really liked about <acronym title="Biomedical Research Institution Information Technology Exchange">BRIITE</acronym> was its focus on stimulating offline discussions. I went to the meeting hoping to find out if there were others in the biomedical informatics community who cared about software development issues. The focus of the meeting was &#8220;IT Support for Multi-Institution Collaborative Research&#8221;. We heard talks on federated identity management, <a href="http://www.globus.org/">Globus</a>, <a href="http://gridshib.globus.org/">GridShib</a>, <a href="http://www.nbirn.net/"><acronym title="Biomedical Informatics Research Network">BIRN</acronym></a>, and, slightly off-topic but still pretty neat, the <a href="http://www.researchchannel.org/">Research Channel</a>. So, the meeting was mostly about distributed security, and many attending were not primarily software development people.</p>

<p>Due to the focus on offline discussions, attendees are encouraged to suggest topics of interest, and then others sign up to discuss these topics as a group (a conference practice I&#8217;ve heard called a &#8220;Birds of a Feather&#8221; session). I proposed software development practices as a topic, and and a small group of us got together for a lively two-hour discussion. I will summarize what we talked about here.</p>

<p>People were interested in the topic for a variety of reasons. One person wanted to know how to introduce better development practices to his group. Another wanted input on how to manage many small projects at once (a common problem in biomedical informatics). Another was plagued with the problem of funding shared informatics resources at a biomedical research institution (people clearly need these resources, but no one wants to or can pay for them). Another wanted to learn more about <a href="http://www.agilealliance.org/">agile</a> software methods. Finally, I wanted to talk about how better to promote awareness and dicussion of software practices in the biomedical informatics community.</p>

<p>In the interest of brevity I&#8217;ll just list bullet points rather than go on and on about each item.</p>

<h3>Funding IT</h3>

<p>While not strictly a software development practice topic, anyone who wants to assemble a decent-sized software team at a biomedical research institution runs up against the problem of funding it. Most of the helpful tips here came from Charles Donnelly of the <a href="http://www.jax.org/">Jackson Laboratory</a>.</p>

<ul>
  <li>If these resources don&#8217;t exist, who provides seed funding to get them started? Investigators typically cannot fund these things from their research grants. Institutional funding and commitment is necessary. Where it exists, people have seen some success in developing shared informatics resources (e.g. the Jackson Laboratory).</li>
  <li>Funding improves if core informatics staff reviews research grants to help make them more reaslitic about informatics needs and funding. Encouraging awareness among scientific investigators and administrators is key.</li>
  <li>Once you have institutional funding, keep track of the time you spend on various grants. Calculate the percentage of your total work this amounts to, and tell the PIs on those grants. This will help them understand how much their informatics costs, even if they aren&#8217;t paying for it yet.</li>
  <li>Generally, PIs that have done some computing in the past are more forward-looking, so these are good people to start with (no-brainer).</li>
  <li>Funding informatics from research grants is difficult, because all budgeting is done ahead of time. Often, however, the informatics needs change over the course of the scientific project. 
How do you come up with specifics like $ and number of FTEs before a project begins? (The answer, it seems to me, is you do what scientists do. You make your best guess, and then you adjust what you produce based on what you have the money to produce once you learn what is possible during the course of the project.)</li>
</ul>

<h3>Project Management</h3>

<p>Mix of suggestions and problems here . . .</p>

<ul>
  <li>One way to scale your capabilities effectively is to develop shared resources used by several groups. Supporting shared resources can be a challenge, however, because PIs may want results formatted their own way, not in a common format.</li>
  <li>Before embarking on a project, draw up a project charter. This document establishes the level of involvement of key people (on both the informatics and science side) and the high-level outline of the work planned. The process of preparing this document will let you know if science people are invested enough in the project, or if there are reasons to worry. One cannot underestimate the importance of buy-in and personal investment in the success of an informatics endeavor by the science leaders on the project.</li>
  <li>Promote awareness of software development: Jax has a software lifecycle process approved by the institution and posted up everywhere.</li>
</ul>

<h3>Requirements/Communication</h3>

<ul>
  <li>Some used a detailed requirements gathering process (at Jax they draw up 150 page documents) to firm up what would be developed.</li>
  <li>Others used <a href="http://www.agilealliance.org/">agile</a> approaches, which are interactive with the scientists throughout the life of the project and make it a specific point to allow the requirements to evolve along with the scientists&#8217; understanding of their needs.</li>
  <li>All agreed that requirements gathering in biomedical software has to be an ongoing, interactive endeavor &#8211; biologists sometimes want to wait until features are in production before they give feedback, they want to see something working and tinker with it.</li>
  <li>Underestimated opportunity for software people: helping biologists with their ad-hoc solutions (Access, Excel, FileMaker Pro) &#8212; biologists will always use Excel, so is there some way we can help them use Excel better? Can we leverage their familiarity with these technologies? A software development background teaches you to look askance on &#8220;low-tech&#8221; solutions like these, but in this domain there is something to be said for them.</li>
  <li>Starting with the front end (UI) and working back helps user interaction &#8212; you can show a prototype of an interface and get better feedback earlier.</li>
  <li>Best ideas don&#8217;t always come from biologists: we are doing our job best when we are helping scientists understand what is possible with software, and helping them focus on their goals and how software could help achieve them.</li>
</ul>

<h3>Looking Forward</h3>

<p>We talked about possible content for a meeting or conference on biomedical and bioinformatics software development. People showed interest in the following.</p>

<ul>
<li>Documenting typical staffing models &#8212; help us all understand our options for organizing our work and getting it funded</li>
<li>Questionnaire &#8212; find out what people are doing, try to get a picture of the current state of affairs</li>
<li>Present experience reports: What working examples are out there?</li>
<li>People preferred a workshop more than a conference, a hands-on experience. Plenty of short talks that don&#8217;t give answers, but raise questions &#8211; followed by open discussions.</li>
  <li>People would like to see defined milestones for the meeting.</li>
  <li>Tutorials on some subjects (e.g. project management, quality assurance and testing, good development practices) would be welcome</li>
  <li>Q&#038;A sessions &#8211; what have you done that can benefit others, what are you doing?</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.moseshohman.com/blog/2005/12/09/briite-2005-la-jolla/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
