High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
GeneTorrent

GeneTorrent is a set of executables for accessing data in the Cancer Genomics Hub (CGHub), a secure repository for storing, cataloging, and accessing cancer genome sequences, alignments, and mutation information from the Cancer Genome Atlas (TCGA) consortium and related projects.

Please note: Access to the CGHub requires authorization by an appropriate data access committee. Upon approval, users will recieve a special key file. The executables in GeneTorrent will not function without a key file. This key file should be kept in your /home directory.

gtdownload is not able to download data through the proxy server that enables limited access to the internet from the compute nodes and the biowulf login node. Therefore, gtdownload needs to be run on helix.

How to use

There are multiple versions of GeneTorrent available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail GeneTorrent

To select a module, type

module load GeneTorrent/[ver]

where [ver] is the version of choice.

The module load statement can be placed in your startup files for permanency.

cgquery

cgquery queries the database based on accession ids or query filters. For example, let's query some publicly accessible CGHUB data and print out a summary:

helix$ cgquery "study=TCGA_MUT_BENCHMARK_4&state=live&filetype=bam"

============================================================================
    Script Version                   : 2.1.9
    CGHub Server                     : https://cghub.ucsc.edu
    WebServices Interface Version    : 3.6
    REST Resource                    : /cghub/metadata/analysisDetail
    QueryString                      : study=TCGA_MUT_BENCHMARK_4&state=live&filetype=bam
    Output File                      : None
----------------------------------------------------------------------------
    Matching Objects                 : 34
============================================================================

    Analysis 1
        analysis_id                  : 02d8b3de-b043-4bfa-9130-adc18195313f
        state                        : live
        last_modified                : 2013-05-16T20:43:45Z
[...snip...]

    Summary of Matching Objects
        downloadable_file_count      : 68
        downloadable_file_size (TB)  : 4.13
        state_count
            live                     : 34

----------------------------------------------------------------------------
    All matching objects are in a downloadable state.
----------------------------------------------------------------------------

Create a manifest file that can be used to download some publicly accessible data:

helix$ cgquery -vv -o query2_result.xml \
          "analysis_id=f0eaa94b-f622-49b9-8eac-e4eac6762598"

For protected data, a private key has to be provided:

helix$ cgquery -c /home/[user]/cghub.key -i "analysis_id=bd263968-0d22-487b-b0a9-68cade255322"

The file /home/[user]/cghub.key is the key file for [user].

gtdownload

gtdownload copies files from CGHub to a local filesystem. For example, to download the public data set we queried above we use the public cghub key:

helix$ gtdownload -c https://www.cghub.ucsc.edu/software/downloads/cghub_public.key \
        -vv -d query2_result.xml
[...snip...]

Note that this will fetch 280GB of data. A quicker example that should finish in about two minutes and uses a UUID directly instead of a manifest file use:

helix$ gtdownload -d ebdb53ae-6386-4bc4-90b1-4f249ff9fcdf \
          -c https://www.cghub.ucsc.edu/software/downloads/cghub_public.key -vv -k 60
[...snip...]

helix$ tree -sh
.
├── [4.0K]  ebdb53ae-6386-4bc4-90b1-4f249ff9fcdf
│   ├── [8.3G]  C835.HCC1143_BL.4.bam
│   └── [5.2M]  C835.HCC1143_BL.4.bam.bai
└── [ 42K]  ebdb53ae-6386-4bc4-90b1-4f249ff9fcdf.gto

Again, for protected data a private key has to be used:

helix$ gtdownload -d 5d97f808-2d44-4db3-9336-ede7680d1eaf -c /home/[user]/cghub.key -vv

Documentation