High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed

Raw sequence data, stored as BAM files, makes up the bulk of data stored at the NCI Genomic Data Commons (GDC). The size of a single file can vary greatly. Most BAM files stored in the GDC are in the 50 MB - 40 GB size range, with some of the whole genome BAM files reaching sizes of 200-300 GB. The GDC Data Transfer Tool provides an optimized method of transferring data to and from the GDC, and enables resumption of interrupted transfers.

The vast majority of data available from the GDC is access controlled. Users will need to first register and obtain an authentication token to access the controlled data.

Use of gdc-client is enabled using modules. To see the modules available, type

module avail gdc-client

To select a module, type

module load gdc-client/[ver]

where [ver] is the version of choice.

Environment variables set:

On Helix

Sample session:

module load gdc-client
gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005
Batch job on Biowulf

Create a batch input file (e.g. download.sh). For example:

module load gdc-client
gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=1 download.sh
Interactive job on Biowulf

Once an interactive session has been started, the steps are identical to those on Helix.