gdc-client on Biowulf

Raw sequence data, stored as BAM files, makes up the bulk of data stored at the NCI Genomic Data Commons (GDC). The size of a single file can vary greatly. Most BAM files stored in the GDC are in the 50 MB - 40 GB size range, with some of the whole genome BAM files reaching sizes of 200-300 GB. The GDC Data Transfer Tool provides an optimized method of transferring data to and from the GDC, and enables resumption of interrupted transfers.

Important Notes

The vast majority of data available from the GDC is access controlled. Users will need to first register and obtain an authentication token to access the controlled data.

Interactive job
Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Create a batch input file (e.g. For example:

module load gdc-client
gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#]
Swarm of Jobs
Create a swarmfile (e.g. gdc-client.swarm). For example:

gdc-client download cce8411d-dc96-5597-b198-fb47c5cf691c
gdc-client download 574b02d5-2de1-5aab-be8d-2c9d251dde9e
gdc-client download ad22c8a4-7767-5427-9271-5f4b506a124c
gdc-client download 4b8af859-b9cd-52b1-bc64-9bfa5d816a5d

Submit this job using the swarm command.

swarm -f gdc-client.swarm [-g #] [-t #] --module gdc-client --maxrunning 10
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module gdc-client Loads the gdc-client module for each subjob in the swarm
--maxrunning 10 Only allow 10 simultaneous downloads at a time

gdc-client pulls data from the GDC Data Portal, which can be overwhelmed by high numbers of simultaneous downloads, causing individual swarm subjobs to fail. It is best to include --maxrunning 10 to prevent this overload.