In order to submit data to the National Institute of Mental Health Data Archives (NDA), users must validate their data to ensure it complies with the required format. This is done using the NDA validation tool, vtcmd. Additionally, users can package and download data from NDA as well, using the downloadcmd tool.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --gres=lscratch:20 [user@cn4192 ~]$ module load nda-tools [+] Loading singularity 3.8.5-1 on cn4192 [+] Loading nda-tools 0.2.12The usage of the command to donwload data from NDA is as follows:
[user@cn3123 user]$ downloadcmd -h Running NDATools Version 0.2.13 usage: downloadcmd <S3_path_list> This application allows you to enter a list of aws S3 paths and will download the files to your drive in your home folder. Alternatively, you may enter a packageID, an NDA data structure file or a text file with s3 links, and the client will download all files from the S3 links listed. Please note, the maximum transfer limit of data is 5TB at one time. positional arguments: <S3_path_list> Will download all S3 files to your local drive optional arguments: -h, --help show this help message and exit -dp <package-id>, --package <package-id> Flags to download all S3 files in package. Required. -t <s3-links-file>, --txt <s3-links-file> Flags that a text file has been entered from where to download S3 files. -ds <structure short-name>, --datastructure <structure short-name> Downloads all the files in a package from the specified data-structure. For example, to download all the image03 files from your package 12345, you should enter: downloadcmd -dp 12345 -ds image03 Note - the program only recognizes the short-names of the data-structures. The short-name is listed on the data-structures page and always ends in a 2 digit number. (For example, see the data-structure page for image03 at https://nda.nih.gov/data_structure.html?short_name=image03) -u <username>, --username <username> NDA username -p <password>, --password <password> NDA password -r, --resume Flags to restart a download process. If you already have some files downloaded, you must enter the directory where they are saved. -d <download_directory>, --directory <download_directory> Enter an alternate full directory path where you would like your files to be saved. The default is ~/NDA/nda-tools/<package-id> -wt <thread-count>, --workerThreads <thread-count> Specifies the number of downloads to attempt in parallel. For example, running 'downloadcmd -dp 12345 -wt 10' will cause the program to download a maximum of 10 files simultaneously until all of the files from package 12345 have been downloaded. A default value is calculated based on the number of cpus found on the machine, however a higher value can be chosen to decrease download times. If this value is set too high the download will slow. With 32 GB of RAM, a value of '10' is probably close to the maximum number of parallel downloads that the computer can handle -q, --quiet Option to suppress output of detailed messages as the program runs. --file-regex <regular expression> Option can be used to download only a subset of the files in a package. This command line arg can be used with the -ds, -dp or -t flags. Examples - 1) To download all files with a ".txt" extension, you can use the regular expression .*.txt downloadcmd -dp 12345 --file-regex .*.txt 2) To download all files that contain "NDARINVZLHFUAF0" in the name, you can use the regular expression NDARINVZLHFUAF0 downloadcmd -dp 12345 -ds image03 --file-regex NDARINVZLHFUAF0 3) Finally to download all files underneath a folder called "T1w" you can use the regular expression .*/T1w/.* downloadcmd -dp 12345 -t s3-links.txt --file-regex .*/T1w/.* --verify When this option is provided a download is not initiated. Instead, a csv file is produced that contains a record of the files in the download, along with information about the file-size if the file could be found on the computer. For large packages containing millions of files, this verification step can take hours (this can be even longer if files are stored on a network drive). When the program finishes, a few new files/folders will be created (if they don't already exist): 1) verification_report folder in the NDA/nda-tools/downloadcmd/packages/<package-id> directory 2) .download_progress folder (hidden) in the NDA/nda-tools/downloadcmd/packages/<package-id> directory, which is used to values between command invocations. a. .download_progress/download-job-manifest.csv file - contains entries mapping b. UUID folders inside .download_progress (with names like '6a056ac4-2dd9-48f2-b921-44b29c883578') 3) download-verification-report.csv in the NDA/nda-tools/downloadcmd/packages/<package-id> directory 4) download-verification-retry-s3-links.csv in the NDA/nda-tools/downloadcmd/packages/<package-id> directory The hidden folder listed in 2 contains special files used by the program to avoid re-running expensive, time-consuming processes. This folder should not be deleted. The download-verification-report.csv file will contain a record for each file in the download and contain 6 columns : 1) 'package_file_id' 2) 'package_file_expected_location' - base path is the value provided for the -d/--directory arg 3) 'nda_s3_url' 4) 'exists' - value for column will be ('Y'/'N') 5) 'expected_size' 6) 'actual_file_size' - value for columnw will be '0' if file doesn't exist In addition, the file will contain 1 header line which will provide the parameters used for the download (more information below). If this file is opened in Excel or Google Docs, the user can easily find information on specific files that they are interested in. This file can be useful but may contain more information than is needed. The download-verification-retry-s3-links.csv file contains the s3 links for all of the files in the download-verification-report.csv where EXISTS = 'N' or EXPECTED-SIZE does not equal ACTUAL-FILE-SIZE. If the user is only interested in re-initiating the download for the files that failed they can do so by using the download-verification-retry-s3-links.csv as the value for the -t argument. i.e. downloadcmd -dp <package-id> -t NDA/nda-tools/downloadcmd/packages/<package-id>/download-verification-retry-s3-links.csv When the --verify option is provided, the rest of the arguments provided to the command are used to determine what files are supposed to be included in the download. For example, if the user runs: downloadcmd -dp 12345 --verify The download-verification-report.csv file will contain a record for each file in the package 12345. Since no -d/--directory argument is provided, the program will check for the existance of the files in the default download location. If the user runs: downloadcmd -dp 12345 -d /home/myuser/customdirectory --verify The download-verification-report.csv file will contain a record for each file in the package 12345 and will check for the existance of files in the /foo/bar If the user runs: downloadcmd -dp 12345 -d /home/myuser/customdirectory -t file-with-s3-links.csv --verify The download-verification-report.csv file will contain a record for each file listed in the file-with-s3-links.csv and will check for the existance of files in /foo/bar If the user runs: downloadcmd -dp 12345 -d /home/myuser/customdirectory -ds image03 --verify The download-verification-report.csv file will contain a record for each file in the package's image03 data-structure and will check for the existance of files in /foo/bar If the user runs: downloadcmd -dp 12345 -d /home/myuser/customdirectory -ds image03 --file-regex --verify The download-verification-report.csv file will contain a record for each file in the package's image03 data-structure which also matches the file-regex and will check for the existance of files in /foo/bar NOTE - at the moment, this option cannot be used to verify downlods to s3 locations (see -s3 option below). That will be implemented in the near future. -s3 <s3 bucket>, --s3-destination <s3 bucket> Specify s3 location which you would like to download your files to. When this option is specified, an attempt will be made to copy the files from your package, which are stored in NDA's own S3 repository, to the S3 bucket provided. This is the preferred way to download data from NDA for two reasons: 1) FASTER !!! - Downloads to another s3 bucket are orders of magnitude faster because data doesn't leave AWS 2) CHEAPER (for us) - We do not limit the amount of data transferred to another bucket, but we do when its downloaded out of AWS. For s3-to-s3 copy operations to be successfull, the s3 bucket supplied as the program arugment must be configured to allow PUT object operations for 'arn:aws:sts::618523879050:federated-user/<username>' where <username> is your nda username. For non-public buckets, this will require an update to your bucket's policy. The following statement should be sufficient to grant the uploading privileges necessary to run this program using the s3 argument (after replacing <your-s3-bucket> with the appropriate value): { "Sid": "AllowNDAUpload", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::618523879050:federated-user/<username>" }, "Action": "s3:PutObject*", "Resource": "arn:aws:s3:::<your-s3-bucket>/*" } You may need to email your company/institution IT department to have this added for you.For example:
[user@cn4192 ~]$ downloadcmd -dp 1185256 -u <your NDA username> -p <your NDA password> -d HCPDevImgManifestBeh -wt 8 -v
[user@cn4192 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$