What computing resources are required?

Ideally, you will have access to a large-memory server, having ~1G of RAM per 1M reads to be assembled. Inchworm, in particular, is memory hungry, and loads the entire k-mer composition of reads into physical memory. In double-stranded mode, it loads the forward and reverse-complemented k-mers into memory, so even more RAM is utilized. Future versions of the Inchworm software will focus on reducing its memory requirements.

How long should this take?

It depends on a number of factors including the number of reads to be assembled, the complexity of the transcript graphs, and the Trinity command-line settings that influence the complexity of the graphs. Inchworm and Chrysalis fairly reliably consume about 1 hour per million reads. Inchworm can take up to twice as long when non-strand-specific RNA-Seq data are used, since it loads both strands of the transcript data.

You can monitor the progress of the system in several ways. (1) status information is written to stdout, (2) while Inchworm is running, it logs status information into the output_dir/monitor.out file, which you can follow by running "tail -f output_dir/monitor.out", and (3) running top to determine which process is running, how long it's been running, and what computation resources are being consumed.

Occassionally, Butterfly may encounter a particularly complicated graph and will seem to take a very long time to traverse it. This can be discovered by running top and finding a Butterfly process that has been running for a long while. If you lack the patience to wait for it to complete, you can kill that process and have an opportunity to rerun it with parameters that should further simplify the graph, such as running that failed Butterfly command with a higher —edge-thr value (suboptimal solution, we know, and are working to improve it).

To troubleshoot individual graphs using Butterfly, see Advanced Guide to Trinity.

How can I run this in parallel on a computing grid?

The Inchworm and Chrysalis steps need to be run on a single server as a single process, however Butterfly can be run in parallel on a computing grid. If you choose to run Butterfly on your computing grid, then do not use the Trinity.pl —run_butterfly parameter. When Chrysalis completes, it creates a file containing all the butterfly commands that need to be executed. The filename should be output_dir/chrysalis/butterfly_commands.adj (be sure to use the .adj extension, since this file includes other butterfly parameters passed through the Trinity.pl wrapper). Each command line in this file can be run independently on different nodes within your computing grid. How to get each of these commands to run on your grid will depend on your computing infrastructure. We have several ways to do this via LSF, but they're all home-grown solutions and not included here.

Once all butterfly commands have been executed, you can retrieve all the resulting assembled transcripts by concatenating the individual result files together like so:

find output_dir/chrysalis -regex ".*allProbPaths.fasta" -exec cat {} \; > Trinity.fasta

How do I identify the specific reads that were incorporated into the transcript assemblies?

Currently, the mappings of reads to transcripts are not reported. To obtain this information, we recommend realigning the reads to the assembled transcripts using Bowtie or BWA. We aim to automate this in the future.

Chrysalis died unexpectedly. What do I do?

In nearly every case, this is because the stacksize memory grew beyond its maximal setting. Chrysalis can sometimes recurse fairly deeply, in which case the stacksize can grow substantially. Before running Trinity, be sure that your stacksize is set to unlimited. On linux, you can check your default settings like so:

bhaas@hyacinth$ limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    100000 kbytes
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  131072
memorylocked 256 kbytes
maxproc      2102272

and you would set the stacksize (and other settings) to unlimited like so:

bhaas@hyacinth$ unlimit

and verify the new settings:

bhaas@hyacinth$ limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    unlimited
coredumpsize unlimited
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  131072
memorylocked 256 kbytes
maxproc      2102272

On Solaris and perhaps Mac OSX, the syntax is different, and might be:

ulimit -s unlimited

On snow leopard, you cannot set it to unlimited for some reason (older versions you could), so try to max it out.

An update to Chrysalis is in the works that will explore alternatives to the recursive processes that sometimes require altered stacksize settings.