High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
SRA-Toolkit

The NCBI SRA SDK generates loading and dumping tools with their respective libraries for building new and accessing existing runs.

There are multiple versions of the SRA-Toolkit available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail sratoolkit

To select a module, type

module load sratoolkit[/ver]

where [ver] is the version of choice. This will set

Configuring SRA-Toolkit On Helix/Biowulf

By default, the SRA Toolkit installed on Biowulf is set up to use the central Biowulf configuration file, which is set up to NOT maintain a local cache of SRA data. After discussion with NCBI SRA developers, it was decided that this was the most appropriate setup for most users on Biowulf. The hisat program can automatically download SRA data as needed.

In some cases, users may want to download SRA data and retain a copy. To download using NCBI's 'prefetch' tool, you would need to set up your own configuration file for the NCBI SRA toolkit. Use the command vdb-config to set up a directory for downloading. In the following example, the vdb-config utility is used to set up /data/$USER/sra-data as the local repository for downloading SRA data. Remember that /home/$USER is limited to a quota of 8 GB, so it is best to direct your downloaded SRA data to /data/$USER.

Sample session: user input in bold.

[susanc@biowulf ~]$ vdb-config --interactive --interactive-mode textual
     vdb-config interactive

  data source

   NCBI SRA: enabled (recommended) (1)

   site    : enabled (recommended) (2)


  local workspaces

  Open Access Data
not cached (not recommended) (3)
location: '' (4)


To cancel and exit      : Press 
To update and continue  : Enter corresponding symbol and Press 

Your choice > 4

Path to Public Repository:

Enter the new path and Press 
Press  to accept the path
> /data/susanc/sra-data

Changing root path to 'Public' repository to '/data/susanc/sra-data'

     vdb-config interactive

  data source

   NCBI SRA: enabled (recommended) (1)

   site    : enabled (recommended) (2)


  local workspaces

  Open Access Data
not cached (not recommended) (3)
location: '/data/susanc/sra-data' (4)


To cancel and exit      : Press 
To save changes and exit: Enter Y and Press 
To update and continue  : Enter corresponding symbol and Press 

Your choice > 3

Enabling user repository caching...

     vdb-config interactive

  data source

   NCBI SRA: enabled (recommended) (1)

   site    : enabled (recommended) (2)


  local workspaces

  Open Access Data
cached (recommended) (3)
location: '/data/susanc/sra-data' (4)

To cancel and exit      : Press 
To save changes and exit: Enter Y and Press 
To update and continue  : Enter corresponding symbol and Press 

Your choice > y
Saving...
Exiting...

[biowulf]$ 

Dealing with encryption keys

Once a repository key has been obtained, you can import it using vdb-config command:

vdb-config --import /path/to/keyfile.ngc

For more information about encrypted data, please see Protected Data Usage Guide at NCBI.

Downloading data from SRA

If you have set up a personal repository as described above, you can download data using the prefetch tool. For example:

[USER@biowulf ~]$  module load sratoolkit

[USER@biowulf ~]$ prefetch SRR390728
Maximum file size download limit is 20,971,520KB

2016-06-13T13:32:12 prefetch.2.5.7: 1) Downloading 'SRR390728'...
2016-06-13T13:32:12 prefetch.2.5.7:  Downloading via http...
2016-06-13T13:32:17 prefetch.2.5.7: 1) 'SRR390728' was downloaded successfully
2016-06-13T13:32:18 prefetch.2.5.7: 'SRR390728' has 0 unresolved dependencies
2016-06-13T13:32:18 prefetch.2.5.7: 'SRR390728' has remote vdbcache
2016-06-13T13:32:18 prefetch.2.5.7:  Downloading via http...

The SRA data for accession SRR390728 will be downloaded to the repository area you have set up. For example:

[USER@biowulf sra-data]$ tree /data/$USER/sra-data/
/data/$USER/sra-data/
+-- files
+-- nannot
+-- refseq
+-- sra
|   +-- SRR390728.sra
|   +-- SRR390728.sra.vdbcache
+-- wgs

5 directories, 2 files

You can also download SRA fastq files using the fastq-dump tool, which will download the fastq file into your current working directory by default.

For example:

[USER@biowulf]$ cd /data/$USER/mydir

[USER@biowulf]$  module load sratoolkit

[USER@biowulf]$ fastq-dump SRR2048331
Read 16600251 spots for SRR2048331
Written 16600251 spots for SRR2048331

Submitting a single batch job

1. Create a script file similar to the one below.

#!/bin/bash 

cd /data/$USER/mydir
module load sratoolkit
fastq-dump some.sra
sam-dump some.sra > my_sam.sam
....
....

2. Submit the script on biowulf:

[biowulf]$ sbatch myscript
Using Swarm

NOTE: The SRA Toolkit executables use random access to read input files. Because of this, users with data located on GPFS filesystems will see significant slowdowns in their jobs. For SRA data (including dbGaP data) it is best to first copy the input files to a local /lscratch/$SLURM_JOBID directory, work on the data in that directory, and copy the results back at the end of the job, as in the example below. See the section on using local disk in the Biowulf User Guide.

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file that downloads SRA data using fastq-dump

# run fastq-dump to download the data, then process further, then copy results back to /data

fastq-dump --aligned --table PRIMARY_ALIGNMENT -O /lscratch/$SLURM_JOBID SRR1234 ; some_command ; \
  cp -R /lscratch/$SLURM_JOBID/some_files  /data/$USER/myoutputdir/
fastq-dump --aligned --table PRIMARY_ALIGNMENT -O /lscratch/$SLURM_JOBID SRR3456 ; some_command; \
  cp -R /lscratch/$SLURM_JOBID/some_files  /data/$USER/myoutputdir/

If you have previously downloaded SRA data into your own directory, you can copy those files to local scratch on the node, process them there, then copy the output back to your /data area. Sample swarm command file:

# copy files from /data, run fastq-dump and other commands, then copy output back to /data
cp /data/user/path/to/SRR1234.sra /lscratch/$SLURM_JOBID; \
  fastq-dump --aligned --table PRIMARY_ALIGNMENT -O /lscratch/$SLURM_JOBID /lscratch/$SLURM_JOBID/SRR1234.sra ; \
  some_other_command ; \
  cp -R /lscratch/$SLURM_JOBID/some_files /data/$USER/myoutputdir/
cp /data/user/path/to/SRR56789.sra /lscratch/$SLURM_JOBID; \
  fastq-dump --aligned --table PRIMARY_ALIGNMENT -O /lscratch/$SLURM_JOBID /lscratch/$SLURM_JOBID/SRR56789.sra ; \
  some_other_command ; \
  cp -R /lscratch/$SLURM_JOBID/some_files /data/$USER/myoutputdir/

[....]

The --gres=lscratch:N must be included in the swarm commands to allocate local disk on the node. For example, to allocate 100GB of scratch space and 4GB of memory:

$ swarm -f cmdfile --module sratoolkit --gres=lscratch:100 -g 4 

For more information regarding running swarm, see swarm.html

Running an interactive job

Allocate an interactive session and run the interactive job there.

[biowulf]$ sinteractive 
salloc.exe: Granted job allocation 789523
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn0135 are ready for job
[cn0135]$ cd /data/$USER/dir
[cn0135]$ module load sratoolkit
[cn0135]$ fastq-dump some.sra  
[cn0135]$ exit
salloc.exe: Job allocation 789523 has been revoked.
[biowulf]$

NOTE: If you get an error that looks like this:
2015-07-20T16:21:15 fastq-dump.2.5.2 err: item not found while constructing within virtual database module - the path 'SRR390729' cannot be opened as database or table
please contact the Biowulf staff.

Documentation