Biowulf High Performance Computing at the NIH
icgc-get on Helix & Biowulf

icgc-get provides a unified interface to the many sources of data from the International Cancer Genome Consortium (ICGC).

Important Notes

IMPORTANT: icgc-get on Biowulf cannot be used to download datasets that reside on either AWS or Collaboratory.

icgc-get must be configured before use. See the example interactive session below for how to do this.

Interactive use
Interactive data transfers are best run on Helix.

Helix has a direct connection to the internet and does not go through one of the HPC proxy servers.
Sample session (user input in bold):

[user@helix ~]$ module load icgc-get
[+] Loading aws  current  on cn3096 
[+] Loading gdc-client 1.4.0 on cn3096 
[+] Loading java 1.8.0_181  ... 
[+] Loading score-client, version 1.6.1... 
[+] Loading ega, version 2.2.2... 
[+] Loading icgc-get, version 0.6.1...

Before using it for the first time, icgc-get must be configured. A baseline configuration is provided as a starting point in /usr/local/apps/icgc-get/config.yaml. It sets the paths to the other programs required by icgc-get. When you run icgc-get configure, the default/pre-configured option (shown in brackets) for a question is used when you don't provide a response. The paths to the various download client installations are already set for you.

[user@helix ~]$ mkdir ~/.icgc-get && cp /usr/local/apps/icgc-get/config.yaml ~/.icgc-get/
[user@helix ~]$ icgc-get configure
You will receive a series of prompts for all relevant configuration values and access parameters
Existing configuration values are listed in square brackets.  To keep these values, press Enter. 
To input multiple values for a prompt, separate each value with a space.

Enter a directory for downloaded files to be saved to. icgc-get will attempt to create it if it does not exist.
output []: 

Enter a location for the process logs to be stored.  Must be in an existing directory.  Optional.
logfile [/home/user/.icgc-get/icgc_get.log]: 

Enter which repositories you want to download from.
Valid repositories are: collaboratory aws-virginia ega gdc pdc
repos [collaboratory ega gdc pdc]: 

The order in which you list the download repositories here indicate their precedence.

aws-virginia cannot be used outside of AWS EC2 instances, so will not work on the HPC systems.

Enter true or false if you wish to use a docker container to download and run all download clients
docker []: false

Enter the path to your local ICGC storage client installation
ICGC path [/usr/local/apps/score-client/current/bin/score-client]: 

Enter a valid ICGC access token
ICGC token []: your_icgc_download_token

Enter the path to your local EGA download client jar file
EGA path [/usr/local/apps/ega/current/EgaDemoClient.jar]: 

Enter your EGA username
EGA username []: username

Enter your EGA password
EGA password []: password

Enter the path to your local GDC download client installation
GDC path [/usr/local/apps/gdc-client/current/bin/gdc-client]: 

Enter a valid GDC access token
GDC token []: gdc_token

Enter the path to your local AWS-cli installation to access the PDC repository
AWS path [/usr/local/apps/aws/current/aws]: 

Enter your PDC s3 key
PDC key []: pdc_key

Enter your PDC s3 secret key
PDC secret key: pdc_secret
Configuration file saved to /home/user/.icgc-get/config.yaml
[user@helix ~]$ cd /data/$USER
[user@helix /data/user]$ icgc-get download file-ids

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. For example:

set -e
module load icgc-get

icgc-get download FI378424

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#]
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. icgc-get.swarm). For example:

icgc-get download FI378424
icgc-get download FI99996
icgc-get download FI99990
icgc-get download FI250134

Submit this job using the swarm command.

swarm -f icgc-get.swarm [-g #] [-t #] --module icgc-get
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module icgc-get Loads the icgc-get module for each subjob in the swarm