The European Genome-phenome Archive (EGA) is designed to be a repository for all types of sequence and genotype experiments, including case-control, population, and family studies.
pyega3 is python-based and data is downloaded over secure https connections, instead of http. This allows EGA to send the data as unencrypted data (via encrypted connections); so, you don’t have to decrypt files after download. Files are verified against the unencrypted MD5 after download (you can also get the unencrypted MD5 via REST call from the API directly). pyega3 supports segmenting (breaking a file into up to 4 segments and downloading them in parallel) and resume; so even if a file is downloaded as one continuous stream, the download will simply resume if there was an error or an interrupted connection.Users will need the appropriate credentials (provided by EGA, contact ega-helpdesk@ebi.ac.uk) to download any data from EGA.
Sample session using the latest pyega3 to download the test EGA dataset (-t flag)
[user@helix ~]$ mkdir -pv /data/${USER}/egatest mkdir: created directory ‘/data/user/egatest’ [user@helix ~]$ cd !$ cd /data/${USER}/egatest [user@helix egatest]$ module load pyega3 [+] Loading singularity 3.7.3 on helix.nih.gov [+] Loading pyega3 3.3.0 ... [user@helix egatest]$ pyega3 -t datasets [2021-04-23 16:33:47 -0400] [2021-04-23 16:33:47 -0400] pyEGA3 - EGA python client version 3.3.0 (https://github.com/EGA-archive/ega-download-client) [2021-04-23 16:33:47 -0400] Parts of this software are derived from pyEGA (https://github.com/blachlylab/pyega) by James Blachly [2021-04-23 16:33:47 -0400] Python version : 3.6.9 [2021-04-23 16:33:47 -0400] OS version : Linux #1 SMP Mon Feb 22 18:03:13 EST 2021 [2021-04-23 16:33:47 -0400] Server URL: https://ega.ebi.ac.uk:8052/elixir/data [2021-04-23 16:33:47 -0400] Session-Id: 3090364833 [2021-04-23 16:33:51 -0400] [2021-04-23 16:33:51 -0400] Authentication success for user 'ega-test-data@ebi.ac.uk' [2021-04-23 16:33:55 -0400] Dataset ID [2021-04-23 16:33:55 -0400] ----------------- [2021-04-23 16:33:55 -0400] EGAD00001003338 [2021-04-23 16:33:55 -0400] EGAD00001006673 [user@helix egatest]$ pyega3 -t files EGAD00001003338 [2021-04-23 16:34:27 -0400] [2021-04-23 16:34:27 -0400] pyEGA3 - EGA python client version 3.3.0 (https://github.com/EGA-archive/ega-download-client) [2021-04-23 16:34:27 -0400] Parts of this software are derived from pyEGA (https://github.com/blachlylab/pyega) by James Blachly [2021-04-23 16:34:27 -0400] Python version : 3.6.9 [2021-04-23 16:34:27 -0400] OS version : Linux #1 SMP Mon Feb 22 18:03:13 EST 2021 [2021-04-23 16:34:27 -0400] Server URL: https://ega.ebi.ac.uk:8052/elixir/data [2021-04-23 16:34:27 -0400] Session-Id: 2040888476 [2021-04-23 16:34:31 -0400] [2021-04-23 16:34:31 -0400] Authentication success for user 'ega-test-data@ebi.ac.uk' [2021-04-23 16:34:38 -0400] File ID Status Bytes Check sum File name [2021-04-23 16:34:38 -0400] EGAF00005007332 1 229305 56e8de04466aba23ab5acbaf1c087045 NA18534.GRCh38DH.exome.cram.crai [2021-04-23 16:34:38 -0400] EGAF00005007331 1 137465 fcf1cc38cd404ea1cdba3975d26f4a8b HG01775.GRCh38DH.exome.cram.crai [2021-04-23 16:34:38 -0400] EGAF00005007330 1 4722 110b493c17210ff3484ed2561a2fe21f HG01775.chrY.bcf.csi [2021-04-23 16:34:38 -0400] EGAF00005007329 1 876313 aaca702e347ae6caa734d44527a49212 HG01775.chrY.bcf [2021-04-23 16:34:38 -0400] EGAF00005007328 1 4981 d0e71e5dd7f5279e113c4f0dfd37fc23 HG01775.chrY.vcf.gz.tbi [2021-04-23 16:34:38 -0400] EGAF00005007327 1 850737 f3dee64b466efe334b2cac77f5c2f710 HG01775.chrY.vcf.gz [2021-04-23 16:34:38 -0400] EGAF00005007326 1 6251 ae2d2097a8744877d9d20907200cbdcf ALL.chrY.phase3_integrated_v2a.20130502.genotypes.bcf.csi [2021-04-23 16:34:38 -0400] EGAF00005007325 1 5527171 395c0d3d454d7c7d61c4f771fbab02fc ALL.chrY.phase3_integrated_v2a.20130502.genotypes.bcf [2021-04-23 16:34:38 -0400] EGAF00005007324 1 8074 fa37e14805cce3221f6f9d3a4cd749a4 ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi [2021-04-23 16:34:39 -0400] EGAF00005007323 1 5719142 388fb466c983d4bec2082941647409f3 ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz [2021-04-23 16:34:39 -0400] EGAF00005007181 1 2938941932 910141b9f4ccbfbf57813dee1a7a3f1d NA18534.GRCh38DH.exome.cram [2021-04-23 16:34:39 -0400] EGAF00005007180 1 1837578063 74d3b803823d3f8b73bd592941f23726 HG01775.GRCh38DH.exome.cram [2021-04-23 16:34:39 -0400] EGAF00005001626 1 27620 09e3b4724404fc7bb5f9948f80016757 ALL_chr22_20130502_2504Individuals.bcf.csi [2021-04-23 16:34:39 -0400] EGAF00005001625 1 186424665 c65ca1a4abd55351598ccbc65ebfa9a6 ALL_chr22_20130502_2504Individuals.bcf [2021-04-23 16:34:39 -0400] EGAF00005001624 1 36094 4202e9a481aa8103b471531a96665047 ALL_chr22_20130502_2504Individuals.vcf.gz.tbi [2021-04-23 16:34:39 -0400] EGAF00005001623 1 214453766 ad7d6e0c05edafd7faed7601f7f3eaba ALL_chr22_20130502_2504Individuals.vcf.gz [2021-04-23 16:34:39 -0400] EGAF00005000665 1 14509 7cf0f467fd44dd783ff05cb4662642b6 NA19238.chr22.bcf.csi [2021-04-23 16:34:39 -0400] EGAF00005000664 1 26957200 62b16cc9ce6ceb3ef97b98c99aa6fec5 NA19238.chr22.bcf [2021-04-23 16:34:39 -0400] EGAF00005000663 1 18596 02fdb6fc68b854f98fef710ff4dee0c1 NA19238.chr22.vcf.gz.tbi [2021-04-23 16:34:39 -0400] EGAF00005000662 1 25444204 274de4071bca5354ff16a1de0116c455 NA19238.chr22.vcf.gz [2021-04-23 16:34:39 -0400] EGAF00001775036 1 4804928 3b89b96387db5199fef6ba613f70e27c ENCFF284YOU.bam.bai [2021-04-23 16:34:39 -0400] EGAF00001775034 1 5991400 b8ae14d5d1f717ab17d45e8fc36946a0 ENCFF000VWO.bam.bai [2021-04-23 16:34:39 -0400] EGAF00001770107 1 3551031027 dfef3f355230915418a78da460665d56 ENCFF284YOU.bam [2021-04-23 16:34:39 -0400] EGAF00001770106 1 462139278 ce073afcbc07afa343f2d4e4d07efeda ENCFF000VWO.bam [2021-04-23 16:34:39 -0400] EGAF00001753757 1 9018288 351130149989cca43fe8c7382e9d326a NA19240.bai [2021-04-23 16:34:39 -0400] EGAF00001753756 1 140445765831 2413ce93a4b2b50fa0c2ff5bdf97695f NA19240.bam [2021-04-23 16:34:39 -0400] EGAF00001753755 1 9005792 767fc92be753de8cf570690bd7fbe629 NA19239.bai [2021-04-23 16:34:39 -0400] EGAF00001753754 1 136016115737 59fbc3828fb878d8e637557ce707d445 NA19239.bam [2021-04-23 16:34:39 -0400] EGAF00001753753 1 9379032 028ab5c73fea03c349e0d73943913141 NA19238.bai [2021-04-23 16:34:39 -0400] EGAF00001753752 1 229774247950 0751106bbe1c4c83ec934a5972a4efdf NA19238.bam [2021-04-23 16:34:39 -0400] EGAF00001753751 1 9204720 c1eadd98469fcd3ced4c51a84b3ce307 NA12892.bai [2021-04-23 16:34:39 -0400] EGAF00001753750 1 66145394874 201bded705401615fe5e90988d509656 NA12892.bam [2021-04-23 16:34:39 -0400] EGAF00001753749 1 9212704 e04dbb7ccbc24ccd853d89b8b066166c NA12891.bai [2021-04-23 16:34:39 -0400] EGAF00001753748 1 4317237247 71a78dfb5258939abab2257a2abd1126 NA12891.bam [2021-04-23 16:34:39 -0400] EGAF00001753747 1 8949984 a23a84c89d338796f78e68804c8d2c6c NA12878.bam.bai [2021-04-23 16:34:39 -0400] EGAF00001753746 1 143427187111 11395de33f28ed867170d2dc723cc700 NA12878.bam [2021-04-23 16:34:39 -0400] EGAF00001753745 1 1622405 18e0e7070b6cf4d042c7f9bee15d56bd NA19240.crai [2021-04-23 16:34:39 -0400] EGAF00001753744 1 48309446909 728bea9317cbab1c98429e43e48f9a83 NA19240.cram [2021-04-23 16:34:39 -0400] EGAF00001753743 1 1514700 be2024ccbf5b3bd9132f6d270a37118c NA19239.crai [2021-04-23 16:34:39 -0400] EGAF00001753742 1 44113571936 d963539652de2ea20005d98e934d59c2 NA19239.cram [2021-04-23 16:34:39 -0400] EGAF00001753741 1 1195785 3b862e018b0b85db7954cbed2e17b6ba NA19238.crai [2021-04-23 16:34:39 -0400] EGAF00001753740 1 34823972801 492780f603da2f5f3306c41011e0acd2 NA19238.cram [2021-04-23 16:34:39 -0400] EGAF00001753739 1 1331384 bb569235226b5b9f0578d34d1b52482e NA12892.crai [2021-04-23 16:34:39 -0400] EGAF00001753738 1 38370156211 a7503d228d0851b999b826b736b8dd32 NA12892.cram [2021-04-23 16:34:39 -0400] EGAF00001753737 1 1310034 0ab7a2d110740561871ccdca7f15f13b NA12891.crai [2021-04-23 16:34:39 -0400] EGAF00001753736 1 38215425935 bbc03793c9534a22f77e751d2723cb10 NA12891.cram [2021-04-23 16:34:39 -0400] EGAF00001753735 1 1575103 41fd8741e91924eae19c6baa7893eeb8 NA12878.crai [2021-04-23 16:34:39 -0400] EGAF00001753734 1 45030910198 040ef7533533a3db67a35b9f454b9269 NA12878.cram [2021-04-23 16:34:39 -0400] -------------------------------------------------------------------------------- [2021-04-23 16:34:39 -0400] Total dataset size = 911.13 GB [user@helix egatest]$ pyega3 -t fetch EGAF00001753735 [2021-04-23 16:35:45 -0400] [2021-04-23 16:35:45 -0400] pyEGA3 - EGA python client version 3.3.0 (https://github.com/EGA-archive/ega-download-client) [2021-04-23 16:35:45 -0400] Parts of this software are derived from pyEGA (https://github.com/blachlylab/pyega) by James Blachly [2021-04-23 16:35:45 -0400] Python version : 3.6.9 [2021-04-23 16:35:45 -0400] OS version : Linux #1 SMP Mon Feb 22 18:03:13 EST 2021 [2021-04-23 16:35:45 -0400] Server URL: https://ega.ebi.ac.uk:8052/elixir/data [2021-04-23 16:35:45 -0400] Session-Id: 1656850299 [2021-04-23 16:35:58 -0400] [2021-04-23 16:35:58 -0400] Authentication success for user 'ega-test-data@ebi.ac.uk' [2021-04-23 16:36:05 -0400] [2021-04-23 16:36:05 -0400] Authentication success for user 'ega-test-data@ebi.ac.uk' [2021-04-23 16:36:05 -0400] File Id: 'EGAF00001753735'(1575103 bytes). [2021-04-23 16:36:05 -0400] Total space : 5839680.00 GiB [2021-04-23 16:36:05 -0400] Used space : 3735395.91 GiB [2021-04-23 16:36:05 -0400] Free space : 2104284.09 GiB [2021-04-23 16:36:05 -0400] Download starting [using 1 connection(s)]... 100%|###########################################################################| 1.58M/1.58M [00:08<00:00, 195kB/s] [2021-04-23 16:36:13 -0400] Combining file chunks (this operation can take a long time depending on the file size) 100%|##########################################################################| 1.58M/1.58M [00:00<00:00, 3.06GB/s] [2021-04-23 16:36:13 -0400] Calculating md5 (this operation can take a long time depending on the file size) 100%|###########################################################################| 1.58M/1.58M [00:00<00:00, 412MB/s] [2021-04-23 16:36:13 -0400] Verifying file checksum [2021-04-23 16:36:13 -0400] Saved to : '/gpfs/gsfs11/users/user/egatest/EGAF00001753735/NA12878.crai'(1575087 bytes, md5=41fd8741e91924eae19c6baa7893eeb8) [2021-04-23 16:36:13 -0400] Download complete [user@helix egatest]$ ls -lh total 129K drwxr-x--- 2 user user 4.0K Apr 23 16:36 EGAF00001753735 -rw-r----- 1 user user 9.1K Apr 23 16:36 pyega3_output.log [user@helix egatest]$
Attempting to make too many simultaneous connections to the EGA download servers is likely to cause slowdowns, instability, and overall slower downloads. Please limit the maximum number of simultaneous downloads as described in this section.
The recommended method to manage downloads of many files from EGA concurrently is by using Swarm while limiting the number of simultaenous subjobs:
("ega_download.swarm")cd /data/$USER/ega; pyega3 -t fetch EGAF00001753734 cd /data/$USER/ega; pyega3 -t fetch EGAF00001753735 cd /data/$USER/ega; pyega3 -t fetch EGAF00001753736 cd /data/$USER/ega; pyega3 -t fetch EGAF00001753737 cd /data/$USER/ega; pyega3 -t fetch EGAF00001753738 cd /data/$USER/ega; pyega3 -t fetch EGAF00001753739 cd /data/$USER/ega; pyega3 -t fetch EGAF00001753740 cd /data/$USER/ega; pyega3 -t fetch EGAF00001753741 cd /data/$USER/ega; pyega3 -t fetch EGAF00001753742 cd /data/$USER/ega; pyega3 -t fetch EGAF00001753743You can then submit this with swarm:
swarm --maxrunning 8 ega_download.swarm
This script is no longer the recommended approach now that PyEGA3 may be used on compute nodes.
WARNING!
The following script will launch background processes that will continue to download data even after you log off of Helix. The EGA archive contains large data sets that can easily fill your quotas. If you decide to use this method or something similar, closely monitor your data and be ready kill processes if they are filling your directories. This example alone downloads a terabyte of data.
First, generate or obtain a file with the names of the files you want to download. In this example we use the pyega3 command to get informatin about the test data set from the archive, and then we use the grep command with a sensible regular expression to generate a file list. You may need to use a different method to obtain or create a file list for you data set.
[user@helix egatest]$ module load pyega3 [+] Loading singularity 3.7.3 on helix.nih.gov [+] Loading pyega3 3.3.0 ... [user@helix egatest]$ pyega3 -t files EGAD00001003338 >raw_file_info 2>&1 [user@helix egatest]$ grep -oP "(EGAF\d{11})" raw_file_info >file_list [user@helix egatest]$ head file_list EGAF00005007332 EGAF00005007331 EGAF00005007330 EGAF00005007329 EGAF00005007328 EGAF00005007327 EGAF00005007326 EGAF00005007325 EGAF00005007324 EGAF00005007323 [user@helix egatest]$
Once you have a file list, you can use the script below (or a modified variant) to start 6 simultaneous loops downloading the files. For our purposes, assume this script is named pyega3loop.sh. You can download this script (via most browsers) here.
#!/bin/bash
usage () {
echo "nohup ${0} <file list> <credentials file>"
echo ""
echo "egaloop will read entries from a <file list>
and start 6 processes to try to download those files to the current
directory using the credentials saved in <credentials file>. It's a
good idea to run this script with the nohup directive"
exit 1
}
# if a credentials file was supplied make the arg string and check that the
# file exists
if [ $# -eq 2 ]; then
CREDOPT="-cf ${2}"
if [ ! -f "${2}" ]; then
usage
fi
# otherwise, assume the -t (test account) option
elif [ $# -eq 1 ]; then
CREDOPT="-t"
# if the user did not supply 1 or 2 args then display usage and exit
else
usage
fi
# also check that the file list exits
if [ ! -f "${1}" ]; then
usage
fi
egaloop () {
# loop through given indices of the file list
STR=$1
END=$2
while [ $STR -le $END ]; do
FILE=$(sed "${STR}q;d" $FLIST)
# echo "pyega3 ${CREDOPT} fetch ${FILE}"
pyega3 ${CREDOPT} fetch ${FILE}
STR=$(($STR+1))
done
}
module load pyega3/3.3.0
FLIST=$1
PROC_N=6
LINE_N=$(wc -l ${FLIST} | awk '{split($0,a," "); print a[1]}')
CHUNK=$(($LINE_N/$PROC_N))
ii=1
THIS_LINE=0
while true; do
# if it's the last iteration make sure to get the rest
# in case of rounding errors
if [ $ii -eq $PROC_N ]; then
# echo "egaloop $(($THIS_LINE+1)) $LINE_N &"
egaloop $(($THIS_LINE+1)) $LINE_N &
break
fi
# echo "egaloop $(($THIS_LINE+1)) $(($THIS_LINE+$CHUNK)) &"
egaloop $(($THIS_LINE+1)) $(($THIS_LINE+$CHUNK)) &
THIS_LINE=$(($THIS_LINE+$CHUNK))
ii=$(($ii+1))
done
You can call this script with the file list as the first argument and credentials file as an optional second argument. The script splits up your file list into 6 different chunks and starts 6 simultaneous processes working to download each of the 6 sub-lists of files. If you begin this script with the nohup directive each process will work silently in the background and save output to the nohup.log file.
[user@helix egatest]$ nohup ./pyega3loop.sh file_list nohup: ignoring input and appending output to ‘nohup.out’ [user@helix egatest]$ ps --forest -u $USER PID TTY TIME CMD 38178 ? 00:00:00 sshd 38203 pts/145 00:00:00 \_ bash 42096 pts/145 00:00:00 \_ ps 41072 pts/145 00:00:00 pyega3loop.sh 41084 pts/145 00:00:00 \_ pyega3 41119 pts/145 00:00:00 \_ starter-suid 41270 pts/145 00:00:00 \_ pyega3 41071 pts/145 00:00:00 pyega3loop.sh 41082 pts/145 00:00:00 \_ pyega3 41120 pts/145 00:00:00 \_ starter-suid 41272 pts/145 00:00:00 \_ pyega3 41068 pts/145 00:00:00 pyega3loop.sh 41086 pts/145 00:00:00 \_ pyega3 41122 pts/145 00:00:00 \_ starter-suid 41271 pts/145 00:00:00 \_ pyega3 41067 pts/145 00:00:00 pyega3loop.sh 41083 pts/145 00:00:00 \_ pyega3 41121 pts/145 00:00:00 \_ starter-suid 41267 pts/145 00:00:00 \_ pyega3 41066 pts/145 00:00:00 pyega3loop.sh 41087 pts/145 00:00:00 \_ pyega3 41129 pts/145 00:00:00 \_ starter-suid 41268 pts/145 00:00:00 \_ pyega3 41065 pts/145 00:00:00 pyega3loop.sh 41085 pts/145 00:00:00 \_ pyega3 41127 pts/145 00:00:00 \_ starter-suid 41269 pts/145 00:00:00 \_ pyega3 [user@helix egatest]$ ls -lh total 8.2M drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00001753740 drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00001753741 drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00001753749 drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00001753757 drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00005000664 drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00005000665 drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00005007323 drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00005007324 drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00005007331 drwxr-x--- 2 user user 4.0K Apr 26 12:46 EGAF00005007332 -rw-r----- 1 user user 768 Apr 26 12:29 file_list -rw------- 1 user user 28K Apr 26 12:46 nohup.out -rwxr-x--- 1 user user 1.7K Apr 26 12:24 pyega3loop.sh -rw-r----- 1 user user 15K Apr 26 12:46 pyega3_output.log -rw-r----- 1 user user 6.7K Apr 26 12:27 raw_file_info [user@helix egatest]$
You can monitor the progress of these downloads using the tail -f command to output the contents of nohup.out or pyega3_output.log. If you need to stop these processes you can do so with the killall command like so:
[user@helix egatest]$ killall --user $USER pyega3loop.sh && killall --user $USER pyega3 [user@helix egatest]$ ps --forest -u $USER PID TTY TIME CMD 38178 ? 00:00:00 sshd 38203 pts/145 00:00:00 \_ bash 40204 pts/145 00:00:00 \_ ps