24/06: R tools for scons

Tags:

The motivation for using scons to automate analysis workflows is to formally specify dependencies in how the analysis gets done. This is the idea behind "reproducible research" (a term clearly coined by non-biologists; I prefer "reproducible analysis"). You run one command and it turns your raw data into figures that are nearly ready for publication, removing a lot of potential sources of error. I've used Sweave for a couple of projects. It's a system that allows you to embed R code chunks in a LaTeX document, and the calculations and figures generated from the embedded code get placed directly into the resulting PDF file. It's a powerful idea, but it doesn't lend itself well to the development of multi-stage analysis workflows where the output of one step in data analysis flows into the next step. Change one thing and you have to rerun the entire script. It makes a lot more sense to design each step as a modular component and formally specify the dependencies between the steps.

In the previous post, I described how to use a custom builder to add python scripts to a scons workflow. You can do the same thing for R scripts and Sweave documents. With a little regular expression kung-fu you can even get scons to recognize which R scripts and data sources are being imported.


import os,re,itertools

source_re = re.compile(r'^source\\([\\'\\"](\\S+)[\\'\\"]\\)',re.M)
load_re = re.compile(r'^load\\([\\'\\"](\\S+?)[\\'\\"]\\)',re.M)
table_re = re.compile(r'read.\\S+\\([\\'\\"](\\S+?)[\\'\\"]',re.M)

def fix_rel(f):
return f if f.startswith('/') else ('#' + f)

def rfile_scan(node, env, path):
txt = node.get_contents()
return [fix_rel(f) for f in itertools.chain(source_re.findall(txt),
load_re.findall(txt),
table_re.findall(txt))]

rbuild = Builder(action='R -q --vanilla $SCRIPTOPTS < $SOURCE')

sweavebuild = Builder(action='R CMD Sweave $SOURCE',
suffix = '.tex',
src_suffix = '.Rnw')

rscan = Scanner(function = rfile_scan,
skeys = ['.R','.Rnw'])


You still have to manually specify what each script outputs (except for the Sweave builder, which knows the output will be a .tex file), for instance:


env = Environment()
env.Append(BUILDERS = {'RBuild' : rbuild,
'SWeave' : sweavebuild})
env.Append(SCANNERS = rscan)

unit_tbl = env.RBuild('unit_stats.tbl','unit_analysis.R')

Comments