nda-tools: tools for handling data from the National Institute of Mental Health Data Archives (NDA)"

nda-tools: tools for handling data from the National Institute of Mental Health Data Archives (NDA)"

Quick Links

In order to submit data to the National Institute of Mental Health Data Archives (NDA), users must validate their data to ensure it complies with the required format. This is done using the NDA validation tool, vtcmd. Additionally, users can package and download data from NDA as well, using the downloadcmd tool.

Documentation

Important Notes

Module Name: nda-tools (see the modules page for more information)
Implemented as a Singularity container
Unusual environment variables set
- NDA-TOOLS_HOME installation directory
- NDA-TOOLS_BIN executables directory
- NDA-TOOLS_DATA sample data directory

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=lscratch:20
[user@cn4192 ~]$ module load nda-tools 
[+] Loading singularity  4.0.3  on cn4192
[+] Loading nda-tools  0.3.0

The usage of the command to donwload data from NDA is as follows:

[user@cn3123 user]$ downloadcmd -h
Running NDATools Version 0.3.0
usage: downloadcmd [-h] -dp <package-id> [-t <s3-links-file>]
                   [-ds <structure short-name>] [-u <username>]
                   [-d <download_directory>] [-wt <thread-count>]
                   [--file-regex <regular expression>] [--verify]
                   [-s3 <s3 bucket>] [--verbose] [--log-dir LOG_DIR]
                   [<S3_path_list> [<S3_path_list> ...]]

This application allows you to download files from an NDA package. Tutorials for creating packages can be found on the website (links provided below). Information for packages, including package-ids, are displayed on the packages dashboard page (https://nda.nih.gov/user/dashboard/packages.html). Users can only download data from "personal" type packages. To download files from a "shared" package you need to convert it to a "personal" package first, which can be done by clicking the "Add to my data packages" button in the actions dropdown.
Links:
        video tutorial - https://nda.nih.gov/tutorials/nda/accessing_files_in_the_cloud.html?chapter=creating-a-package
        pdf - https://ndar.nih.gov/ndarpublicweb/Documents/Accessing+Shared+Data+Sept_2021-1.pdf

positional arguments:
  <S3_path_list>        Opional. When provided, the program will download only the specified files from the package. The specified files must exist in the package indicated by the -dp argument and the paths must be valid s3 urls.

required arguments:
  -dp <package-id>, --package <package-id>
                        The package-id containing the files you wish to download. If no other command-line options are provided, the program will download all files from the specified package.

optional arguments:
  -h, --help            show this help message and exit
  -t <s3-links-file>, --txt <s3-links-file>
                        Flags that a text file has been entered from where to download S3 files. For more details, check the information on the README page.
  -ds <structure short-name>, --datastructure <structure short-name>
                        Downloads all the files in a package from the specified data-structure.
                        For example, to download all the image03 files from your package 12345, you should enter:
                            downloadcmd -dp 12345 -ds image03
                        Note - the program only recognizes the short-names of the data-structures. The short-name is listed on the data-structures page
                        and always ends in a 2 digit number. (For example, see the data-structure page for image03 at https://nda.nih.gov/data_structure.html?short_name=image03)
  -u <username>, --username <username>
                        NDA username
  -d <download_directory>, --directory <download_directory>
                        Enter an alternate full directory path where you would like your files to be saved. The default is ~/NDA/nda-tools/<package-id>
  -wt <thread-count>, --workerThreads <thread-count>
                        Specifies the number of downloads to attempt in parallel. For example, running 'downloadcmd -dp 12345 -wt 10' will
                        cause the program to download a maximum of 10 files simultaneously until all of the files from package 12345 have been downloaded.
                        A default value is calculated based on the number of cpus found on the machine, however a higher value can be chosen to decrease download times.
                        If this value is set too high the download will slow. With 32 GB of RAM, a value of '10' is probably close to the maximum number of
                        parallel downloads that the computer can handle
  --file-regex <regular expression>
                        Option can be used to download only a subset of the files in a package.  This command line arg can be used with
                        the -ds, -dp or -t flags.

                        Examples -
                        1) To download all files with a ".txt" extension, you can use the regular expression .*.txt
                            downloadcmd -dp 12345 --file-regex .*.txt
                        2) To download all files that contain "NDARINVZLHFUAF0" in the name, you can use the regular expression NDARINVZLHFUAF0
                            downloadcmd -dp 12345 -ds image03 --file-regex NDARINVZLHFUAF0
                        3) Finally to download all files underneath a folder called "T1w" you can use the regular expression .*/T1w/.*
                            downloadcmd -dp 12345 -t s3-links.txt --file-regex .*/T1w/.*
  --verify              When this option is provided a download is not initiated. Instead, a csv file is produced that contains a record of
                        the files in the download, along with information about the file-size if the file could be found on the computer. For large packages containing millions of files,
                        this verification step can take hours (this can be even longer if files are stored on a network drive). When the program finishes, a few new files/folders
                        will be created (if they don't already exist):
                        1) verification_report folder in the NDA/nda-tools/downloadcmd/packages/<package-id> directory
                        2) .download_progress folder (hidden) in the NDA/nda-tools/downloadcmd/packages/<package-id> directory, which is used to values between command invocations.
                            a. .download_progress/download-job-manifest.csv file - contains entries mapping
                            b. UUID folders inside .download_progress (with names like '6a056ac4-2dd9-48f2-b921-44b29c883578')
                        3) download-verification-report.csv in the NDA/nda-tools/downloadcmd/packages/<package-id> directory
                        4) download-verification-retry-s3-links.csv in the NDA/nda-tools/downloadcmd/packages/<package-id> directory

                        The hidden folder listed in 2 contains special files used by the program to avoid re-running expensive, time-consuming processes. This folder should not be deleted.

                        The download-verification-report.csv file will contain a record for each file in the download and contain 6 columns :
                        1) 'package_file_id'
                        2) 'package_file_expected_location' - base path is the value provided for the -d/--directory arg
                        3) 'nda_s3_url'
                        4) 'exists' - value for column will be ('Y'/'N')
                        5) 'expected_size'
                        6) 'actual_file_size' - value for columnw will be '0' if file doesn't exist

                        In addition, the file will contain 1 header line which will provide the parameters used for the download (more information below).

                        If this file is opened in Excel or Google Docs, the user can easily find information on specific files that they are interested in.

                        This file can be useful but may contain more information than is needed. The download-verification-retry-s3-links.csv file contains the s3 links for all of the files
                        in the download-verification-report.csv where EXISTS = 'N' or EXPECTED-SIZE does not equal ACTUAL-FILE-SIZE. If the user is only interested in re-initiating the download
                        for the files that failed they can do so by using the  download-verification-retry-s3-links.csv as the value for the -t argument. i.e.

                        downloadcmd -dp <package-id> -t NDA/nda-tools/downloadcmd/packages/<package-id>/download-verification-retry-s3-links.csv

                        When the --verify option is provided, the rest of the arguments provided to the command are used to determine what files are supposed to be included in the download.

                        For example, if the user runs:
                           downloadcmd -dp 12345 --verify
                        The download-verification-report.csv file will contain a record for each file in the package 12345. Since no -d/--directory argument is provided, the program
                        will check for the existance of the files in the default download location.

                        If the user runs:
                           downloadcmd -dp 12345 -d /home/myuser/customdirectory --verify
                        The download-verification-report.csv file will contain a record for each file in the package 12345 and will check for the existance of files in the /foo/bar

                        If the user runs:
                           downloadcmd -dp 12345 -d /home/myuser/customdirectory -t file-with-s3-links.csv --verify
                        The download-verification-report.csv file will contain a record for each file listed in the file-with-s3-links.csv and will check for the existance of files in /foo/bar

                        If the user runs:
                           downloadcmd -dp 12345 -d /home/myuser/customdirectory -ds image03 --verify
                        The download-verification-report.csv file will contain a record for each file in the package's image03 data-structure and will check for the existance of files in /foo/bar

                        If the user runs:
                           downloadcmd -dp 12345 -d /home/myuser/customdirectory -ds image03 --file-regex --verify
                        The download-verification-report.csv file will contain a record for each file in the package's image03 data-structure which also matches the file-regex and will check
                        for the existance of files in /foo/bar

                        NOTE - at the moment, this option cannot be used to verify downloads to s3 locations (see -s3 option below). That will be implemented in the near
                        future.
  -s3 <s3 bucket>, --s3-destination <s3 bucket>
                        Specify s3 location which you would like to download your files to. When this option is specified, an attempt will be made
                        to copy the files from your package, which are stored in NDA's own S3 repository, to the S3 bucket provided.

                        For s3-to-s3 copy operations to be successful, the s3 bucket supplied as the program argument must be configured to allow PUT object
                        operations  for 'arn:aws:sts::618523879050:federated-user/<username>' where <username> is your nda username.
                        For non-public buckets, this will require an update to your bucket's policy. The following statement should be sufficient to grant the uploading privileges necessary
                        to run this program using the s3 argument after replacing <your-s3-bucket> with the bucket name:
                                 {
                                    "Sid": "AllowNDAUpload",
                                    "Effect": "Allow",
                                    "Principal": {
                                        "AWS": "arn:aws:iam::618523879050:federated-user/<username>"
                                    },
                                    "Action": "s3:PutObject*",
                                    "Resource": "arn:aws:s3:::<your-s3-bucket>/*"
                                }

                        You may need to email your company/institution IT department to have this added for you.
                        Note: If your bucket is encrypted with a customer-managed KMS key, then additional configuration is needed.
                        For more details, check the information on the README page.
  --verbose             Enables debug logging.
  --log-dir LOG_DIR     Customize the file directory of logs. If this value is not provided or the provided directory does not exist, logs will be saved to NDA/nda-tools/downloadcmd/logs inside your root folder.

For example:

[user@cn4192 ~]$ downloadcmd -dp 1185256 -u <your NDA username> -d  HCPDevImgManifestBeh -wt 8

[user@cn4192 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$