find output_dir/chrysalis -regex ".*allProbPaths.fasta" -exec cat {} \; > Trinity.fasta
Ideally, you will have access to a large-memory server, having ~1G of RAM per 1M reads to be assembled. Inchworm, in particular, is memory hungry, and loads the entire k-mer composition of reads into physical memory. In double-stranded mode, it loads the forward and reverse-complemented k-mers into memory, so even more RAM is utilized. Future versions of the Inchworm software will focus on reducing its memory requirements.
It depends on a number of factors including the number of reads to be assembled, the complexity of the transcript graphs, and the Trinity command-line settings that influence the complexity of the graphs. Inchworm and Chrysalis fairly reliably consume about 1 hour per million reads. Inchworm can take up to twice as long when non-strand-specific RNA-Seq data are used, since it loads both strands of the transcript data.
You can monitor the progress of the system in several ways. (1) status information is written to stdout, (2) while Inchworm is running, it logs status information into the output_dir/monitor.out file, which you can follow by running "tail -f output_dir/monitor.out", and (3) running top to determine which process is running, how long it's been running, and what computation resources are being consumed.
Occassionally, Butterfly may encounter a particularly complicated graph and will seem to take a very long time to traverse it. This can be discovered by running top and finding a Butterfly process that has been running for a long while. If you lack the patience to wait for it to complete, you can kill that process and have an opportunity to rerun it with parameters that should further simplify the graph, such as running that failed Butterfly command with a higher —edge-thr value (suboptimal solution, we know, and are working to improve it).
To troubleshoot individual graphs using Butterfly, see Advanced Guide to Trinity.
The Inchworm and Chrysalis steps need to be run on a single server as a single process, however Butterfly can be run in parallel on a computing grid. If you choose to run Butterfly on your computing grid, then do not use the Trinity.pl —run_butterfly parameter. When Chrysalis completes, it creates a file containing all the butterfly commands that need to be executed. The filename should be output_dir/chrysalis/butterfly_commands.adj (be sure to use the .adj extension, since this file includes other butterfly parameters passed through the Trinity.pl wrapper). Each command line in this file can be run independently on different nodes within your computing grid. How to get each of these commands to run on your grid will depend on your computing infrastructure. We have several ways to do this via LSF, but they're all home-grown solutions and not included here.
Once all butterfly commands have been executed, you can retrieve all the resulting assembled transcripts by concatenating the individual result files together like so:
find output_dir/chrysalis -regex ".*allProbPaths.fasta" -exec cat {} \; > Trinity.fasta
Currently, the mappings of reads to transcripts are not reported. To obtain this information, we recommend realigning the reads to the assembled transcripts using Bowtie or BWA. We aim to automate this in the future.
In nearly every case, this is because the stacksize memory grew beyond its maximal setting. Chrysalis can sometimes recurse fairly deeply, in which case the stacksize can grow substantially. Before running Trinity, be sure that your stacksize is set to unlimited. On linux, you can check your default settings like so:
bhaas@hyacinth$ limit
cputime unlimited filesize unlimited datasize unlimited stacksize 100000 kbytes coredumpsize 0 kbytes memoryuse unlimited vmemoryuse unlimited descriptors 131072 memorylocked 256 kbytes maxproc 2102272
and you would set the stacksize (and other settings) to unlimited like so:
bhaas@hyacinth$ unlimit
and verify the new settings:
bhaas@hyacinth$ limit
cputime unlimited filesize unlimited datasize unlimited stacksize unlimited coredumpsize unlimited memoryuse unlimited vmemoryuse unlimited descriptors 131072 memorylocked 256 kbytes maxproc 2102272
On Solaris and perhaps Mac OSX, the syntax is different, and might be:
ulimit -s unlimited
On snow leopard, you cannot set it to unlimited for some reason (older versions you could), so try to max it out.
An update to Chrysalis is in the works that will explore alternatives to the recursive processes that sometimes require altered stacksize settings.
Occassionally (very rarely, such as one component per tens of thousands, if at all) Butterfly will encounter a complicated transcript graph and seems to take an eternity to process it. You will notice this by running top and seeing a java process that has been running for a very long time. For example, I'm running a dozen butterfly commands on my large server (22 cores, 256 GB RAM) and I can see various butterfly jobs running as java in the view:
Tasks: 500 total, 7 running, 493 sleeping, 0 stopped, 0 zombie top - 09:13:33 up 131 days, 21:07, 4 users, load average: 70.72, 53.70, 28.00Tasks: 510 total, 9 running, 501 sleeping, 0 stopped, 0 zombie Cpu(s): 89.1%us, 10.4%sy, 0.0%ni, 0.2%id, 0.0%wa, 0.1%hi, 0.2%si, 0.0%stMem: 264349428k total, 48345144k used, 216004284k free, 126640k buffers Swap: 8385920k total, 314336k used, 8071584k free, 18855720k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7775 bhaas 16 0 1373m 302m 8724 S 201.2 0.1 0:04.02 java 7735 bhaas 17 0 1358m 329m 8776 S 171.1 0.1 0:04.47 java 7310 bhaas 17 0 1300m 359m 8804 S 140.9 0.1 0:07.84 java 8194 bhaas 17 0 1294m 165m 8680 S 125.8 0.1 0:01.88 java 8313 bhaas 18 0 1356m 36m 8580 S 98.1 0.0 0:00.73 java 8075 bhaas 17 0 1290m 53m 8668 S 93.1 0.0 0:01.18 java 10241 bhaas 18 0 1376m 604m 8820 S 88.0 0.2 4:31.80 java 32424 bhaas 18 0 1306m 474m 8816 S 88.0 0.2 0:58.53 java 8143 bhaas 17 0 1292m 48m 8664 S 85.5 0.0 0:01.23 java 8258 bhaas 17 0 1291m 48m 8656 S 80.5 0.0 0:01.07 java 1305 bhaas 17 0 1377m 509m 8820 S 78.0 0.2 0:56.11 java 10247 bhaas 18 0 1356m 1.0g 8812 S 78.0 0.4 4:26.23 java ...
A way to see exactly what jobs are running is to execute the following:
bhaas@hyacinth$ ps auxww | grep Butterfly bhaas 4588 50.3 0.1 1355708 435476 pts/4 Sl 09:17 0:38 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 9814096 -L 300 -F 300 -C chrysalis/RawComps.0/comp374 --edge-thr=0.16 bhaas 5920 51.3 0.1 1353604 409604 pts/4 Sl 09:18 0:33 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10114793 -L 300 -F 300 -C chrysalis/RawComps.0/comp412 --edge-thr=0.16 bhaas 7747 53.0 0.2 1325344 530752 pts/4 Sl 09:13 3:01 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 11032490 -L 300 -F 300 -C chrysalis/RawComps.0/comp127 --edge-thr=0.16 bhaas 10241 56.5 0.2 1409492 625972 pts/4 Sl 09:06 7:18 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10630881 -L 300 -F 300 -C chrysalis/RawComps.0/comp2 --edge-thr=0.16 bhaas 10247 51.9 0.4 1389204 1077640 pts/4 Sl 09:06 6:42 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10702374 -L 300 -F 300 -C chrysalis/RawComps.0/comp0 --edge-thr=0.16 bhaas 10249 51.8 0.4 1394704 1082764 pts/4 Sl 09:06 6:41 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10702374 -L 300 -F 300 -C chrysalis/RawComps.0/comp1 --edge-thr=0.16
Most of the butterfly commands have been running for only a short period of time (seconds), but there are a couple that have been running for several minutes. Most commands will take less than a few minutes to run, and some can take up to an hour. If you see a butterfly command (java) that has been running for many hours, you can consider killing it and trying it again later with altered butterfly parameters. There are a couple of ways to kill the process.
From the command line, you can kill it like so:
kill $pid
where $pid is the process ID in the first column of the top output or second column of the ps output.
From within top, you can kill it by typing k, enter, $pid, enter. (on linux, this is how it works; your system may vary).
Once a Butterfly command has finished (or you've killed it to retry it later), the next butterfly command in the queue will take its place.
If all Butterfly commands complete successfully, then the Trinity.pl wrapper script will report success and concatenate all the individual butterfly assembly outputs into a single file (Trinity.fasta). If any commands did not succeed, then the failed (or killed) commands will be reported and written to a file so that you can adjust the parameters (such as tweak the —edge-thr value to a higher setting) and rerun, or troubleshoot (see Advanced Guide to Trinity).