Box is one of the collaboration tools provided by NIH. It can be used for collaboration and file sharing with NIH users as well as users outside the NIH. Files can be directly transfered to/from your NIH box to HPC systems storage.
OneDrive is the Microsoft cloud service which is part of NIH Microsoft 365 Subscription. Files can be shared within NIH and all NIH users have access
Box | OneDrive | |
---|---|---|
URL | NIH Box | NIH OneDrive |
Documentation | CIT-BOX documentation | CIT-OneDrive documentation |
Accounts | Need to apply for an account via the online form | All NIH personnel have automatic access |
Space limit | unlimited | Initial allocation 1 TB, can request up to 5 TB |
Individual file size limit | 15 GB (note: rclone can chunk files) | 250 GB (note: rclone can chunk files) |
Data sharing | Can share outside NIH | Can only share within NIH |
Access | Can access from anywhere (NIH network or VPN not required) | Must be on NIH network or VPN to access |
Transfer to/from Helix/Biowulf | via rclone (Globus connector coming...) | via rclone or Globus (Note: individual file size limit -- 100 GB for Globus) |
Sample transfer speeds from Helix using rclone | 24 MB/s (1 GB file - 40 seconds) 35 MB/s (10 GB file - 5 mins) |
9.5 MB/s (1 GB file - 1.6 mins) 9.3 MB/s (10 GB file - 17 mins) |
Access/download trail | Can track whether data has been accessed or downloaded | ? |
The transfer speeds above are samples. You may get higher or lower transfer speeds, depending on:
rclone has to be configured to connect to biowulf. Here is an outline of the steps needed:
During this process a token has to be copy and pasted. Some tokens are too long
and result in an error during configuration or token renewal. If that happens you can carry out
the entire configuration on your local system and then copy the resulting config file from
your local system to ~/.config/rclone/rclone.conf
(or just a section).
Instructions and downloads can be found on the rclone download page.
Connect to helix.nih.gov as usual and begin the rclone configuration step.
The configuration for rclone is stored in ~/.config/rclone/rclone.conf
and
can optionally be encrypted. This can be done when first setting up the configuration or
later in a new rclone config
session.
If you forget your encryption password, you will have to delete your rclone.conf
file and start over.
The answers to the interactive prompts are highlighed in yellow below.
(helix)$ module load rclone (helix)$ rclone config 2019/09/03 09:20:27 NOTICE: Config file "/home/user/.config/rclone/rclone.conf" not found - using defaults No remotes found - make a new one n) New remote s) Set configuration password q) Quit config n/s/q> s Your configuration is not encrypted. If you add a password, you will protect your login information to cloud services. a) Add Password q) Quit to main menu a/q> a Enter NEW configuration password: password: ****** Confirm NEW configuration password: password: ****** Password set Your configuration is encrypted. c) Change Password u) Unencrypt configuration q) Quit to main menu c/u/q> q No remotes found - make a new one n) New remote s) Set configuration password q) Quit config n/s/q> n name> box Type of storage to configure. Enter a string value. Press Enter for the default (""). Choose a number from below, or type in your own value 1 / 1Fichier \ (fichier) 2 / Akamai NetStorage \ (netstorage) 3 / Alias for an existing remote \ (alias) 4 / Amazon Drive \ (amazon cloud drive) 5 / Amazon S3 Compliant Storage Providers including AWS, Alibaba, Ceph, China Mobile, Cloudflare, ArvanCloud, DigitalOcean, Dreamhost, Huawei OBS, IBM COS, IDrive e2, IONOS Cloud, Liara, Lyve Cloud, Minio, Netease, RackCorp, Scaleway, SeaweedFS, StackPath, Storj, Tencent COS, Qiniu and Wasabi \ (s3) 6 / Backblaze B2 \ (b2) 7 / Better checksums for other remotes \ (hasher) 8 / Box \ (box) ... Storage> box ** See help for box backend at: https://rclone.org/box/ ** Option client_id. OAuth Client Id. Leave blank normally. Enter a value. Press Enter to leave empty. client_id> Option client_secret. OAuth Client Secret. Leave blank normally. Enter a value. Press Enter to leave empty. client_secret> Option box_config_file. Box App config.json location Leave blank normally. Leading `~` will be expanded in the file name as will environment variables such as `${RCLONE_CONFIG_DIR}`. Enter a value. Press Enter to leave empty. box_config_file> Option access_token. Box App Primary Access Token Leave blank normally. Enter a value. Press Enter to leave empty. access_token> Option box_sub_type. Choose a number from below, or type in your own string value. Press Enter for the default (user). 1 / Rclone should act on behalf of a user. \ (user) 2 / Rclone should act on behalf of a service account. \ (enterprise) box_sub_type> 1 Edit advanced config? y) Yes n) No (default) y/n> n Use web browser to automatically authenticate rclone with remote? * Say Y if the machine running rclone has a web browser you can use * Say N if running rclone on a (remote) machine without web browser access If not sure try Y. If Y failed, try N. y) Yes (default) n) No y/n> n Option config_token. For this to work, you will need rclone available on a machine that has a web browser available. For more help and alternate methods see: https://rclone.org/remote_setup/ Execute the following on the machine with the web browser (same rclone version recommended): rclone authorize "box" Then paste the result. Enter a value. config_token>
On your local computer run rclone authorize box
. For example on a windows
system with powershell:
PS C:\Users\user\...\rclone-v1.58.1-windows-amd64> .\rclone.exe authorize box 2022/06/15 15:42:25 NOTICE: Config file "C:\\Users\\user\\AppData\\Roaming\\rclone\\rclone.conf" not found - using defaults 2022/06/15 15:42:25 NOTICE: If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth?state=... 2022/06/15 15:42:25 NOTICE: Log in and authorize rclone for access 2022/06/15 15:42:25 NOTICE: Waiting for code...
Note that on a Mac you may need an administrator to install and use rclone
At this point rclone will open a browser window:
After choosing single sign on you will be prompted to enter your email address. Please
use your @mail.nih.gov
or @nih.gov
address.
You may be redirected to an intermediate microsoft signin page. Either enter your mail again or, select it from the available options if it is present already.
Then you should see an NIH login page where you will have to use NIH\username
with your username:
Once you log in, box will ask you if it should grant access to rclone. Please note that this token will expire and needs to be renewed.
You should see a success message in the browser
and the rclone config in the powershell window (or MacOS terminal) will resume and show something like this:
2022/06/15 15:43:17 NOTICE: Got code Paste the following into your remote machine ---> {"access_token":"...","token_type":"bearer","refresh_token":"...","expiry":"..."} <---End paste
Copy the code and paste it into the rclone config
prompt on helix
> {"access_token":"...","token_type":"bearer","refresh_token":"...","expiry":"..."} -------------------- Configuration complete. Options: type = box box_sub_type = user token = {"access_token":"xxxx","token_type":"bearer","refresh_token":"xxxx","expiry":"..."} Keep this "box" remote? -------------------- y) Yes this is OK e) Edit this remote d) Delete this remote y/e/d> y Current remotes: Name Type ==== ==== box box e) Edit existing remote n) New remote d) Delete remote r) Rename remote c) Copy remote s) Set configuration password q) Quit config e/n/d/r/c/s/q> q
A simple example of rclone usage is shown below. Please see the official
rclone documentation for more
details. Since the configuration in this example is encrypted, the environment
variable RCLONE_CONFIG_PASS
is used to avoid having to re-type the
configuration password for every action.
(helix)$ module load rclone (helix)$ read -rs RCLONE_CONFIG_PASS #Here you should enter the rclone configuration password you set up earlier. (helix)$ export RCLONE_CONFIG_PASS (helix)$ rclone mkdir box:bam_files (helix)$ rclone copy ENCFF034UET.bam box:bam_files (helix)$ rclone copy --progress gcat_set_053.bam box:bam_files Transferred: 5.942G / 5.942 GBytes, 100%, 46.246 MBytes/s, ETA 0s Errors: 0 Transferred: 1 / 1, 100% Elapsed time: 2m11.5s (helix)$ rclone sync --progress --update 0_fast5 box:0_fast5 # interrupted with Ctrl-C Transferred: 2.208G / 4.230 GBytes, 52%, 12.301 MBytes/s, ETA 2m48s Errors: 0 Checks: 0 / 0, - Transferred: 300 / 8001, 4% Elapsed time: 3m3.7s ... (helix)$ rclone sync --progress --update 0_fast5 box:0_fast5 # resumed the sync ... (helix)$ rclone ls box: 903112 biowulf-2000.png 9287173845 bam_files/ENCFF034UET.bam 2278307840 0_fast5/3_test_data.tar ... (helix)$ rclone ls --max-depth=1 box: 903112 biowulf-2000.png (helix)$ rclone lsd box: -1 2019-09-03 09:07:55 -1 0_fast5 -1 2019-09-03 08:56:47 -1 bam_files (helix)$ rclone delete box:0_fast5 (helix)$ rclone lsd box: -1 2019-09-03 08:56:47 -1 bam_files
Transferring large numbers of small files is less performant with much lower transfer rates.
This can be improved somewhat by increasing the number of concurrent transfers and checkers
with --checkers 128 --transfers 128
, but a better solution is to upload
a tar file instead. This can be done without creating a tar file on disk:
(helix)$ tar -cz directory | rclone -P rcat box:directory.tar.gz
Files shared with users via their NIH Box account are available for copy/sync/... in the same way as files uploaded by users themselves. For example
(helix)$ rclone lsd box: -1 2019-12-04 12:00:02 -1 TYCTWD ### <-- this folder was shared with me -1 2019-11-25 08:02:28 -1 bam_files
Using rclone to access OneDrive requires authorization. Please contact hpc staff at
staff@hpc.nih.gov
to be added to the authorized group.
rclone can be configured to access NIH OneDrive following the same workflow as above for NIH Box. Here is the rclone configuration session:
[user@helix]$ rclone config Current remotes: Name Type ==== ==== box box e) Edit existing remote n) New remote d) Delete remote r) Rename remote c) Copy remote s) Set configuration password q) Quit config e/n/d/r/c/s/q> n name> onedrive Type of storage to configure. Enter a string value. Press Enter for the default (""). Choose a number from below, or type in your own value 30 / Microsoft Azure Blob Storage \ (azureblob) 31 / Microsoft OneDrive \ (onedrive) 32 / OpenDrive \ (opendrive) ... Storage> onedrive ** See help for onedrive backend at: https://rclone.org/onedrive/ ** Option client_id. OAuth Client Id. Leave blank normally. Enter a value. Press Enter to leave empty. client_id> Option client_secret. OAuth Client Secret. Leave blank normally. Enter a value. Press Enter to leave empty. client_secret> Option region. Choose national cloud region for OneDrive. Choose a number from below, or type in your own string value. Press Enter for the default (global). 1 / Microsoft Cloud Global \ (global) 2 / Microsoft Cloud for US Government \ (us) 3 / Microsoft Cloud Germany \ (de) 4 / Azure and Office 365 operated by Vnet Group in China \ (cn) region> 1 Edit advanced config? (y/n) y) Yes n) No y/n> n Remote config Use auto config? * Say Y if not sure * Say N if you are working on a remote or headless machine y) Yes n) No y/n> n For this to work, you will need rclone available on a machine that has a web browser available. For more help and alternate methods see: https://rclone.org/remote_setup/ Execute the following on your machine (same rclone version recommended) : rclone authorize "onedrive" Then paste the result below: Enter a value. config_token>
At this point use the same approach as above with your local rclone installation to obtain a authorization token and paste it into the config prompt
config_token> {"access_token":"...","expiry":"..."} Option config_type. Type of connection Choose a number from below, or type in an existing value 1 / OneDrive Personal or Business \ "onedrive" 2 / Root Sharepoint site \ "sharepoint" 3 / Type in driveID \ "driveid" 4 / Type in SiteID \ "siteid" 5 / Search a Sharepoint site \ "search" Your choice> 1 Option config_driveid. Select drive you want to use Choose a number from below, or type in your own string value. Press Enter for the default (b!uNjwdhQxy0aQz-OAHwa4CnJfGNEzxMpAkUmTvdBYoadkfRGG_wxBQbmo4JlG9JAp). 1 / OneDrive (business) \ (...) config_driveid>1 Drive OK? Found drive "root" of type "business" URL: https://nih-my.sharepoint.com/personal/user_nih_gov/Documents y) Yes (default) n) No y/n> y Configuration complete. Options: - type: onedrive - region: global - token: {"access_token":"...",,"expiry":"..."} - drive_id: ... - drive_type: business Keep this "onedrive" remote? y) Yes this is OK (default) e) Edit this remote d) Delete this remote y/e/d>y Current remotes: Name Type ==== ==== box box onedrive onedrive e) Edit existing remote n) New remote d) Delete remote r) Rename remote c) Copy remote s) Set configuration password q) Quit config e/n/d/r/c/s/q> q
If the rclone destination has a limit on an individual file size (e.g. Box and Onedrive above), and you need to transfer a file larger than the limit, you can do so via the rclone transparent overlay option.
You will first need to set up the basic Box or OneDrive remote as described above. Next, you will add a second, wrapper remote using the "chunker" remote type. When using this remote, any files underneath the maximum size will be transferred normally, but larger files will be split into chunks.
An example for a Box remote set up as in our example using maximum 10GB files:
[user@helix]$ rclone config Current remotes: Name Type ==== ==== box box e) Edit existing remote n) New remote d) Delete remote r) Rename remote c) Copy remote s) Set configuration password q) Quit config e/n/d/r/c/s/q> n name> box_chunked Type of storage to configure. Enter a string value. Press Enter for the default (""). Choose a number from below, or type in your own value 1 / 1Fichier \ "fichier" ... 29 / Transparently chunk/split large files \ "chunker" ... 33 / http Connection \ "http" 34 / premiumize.me \ "premiumizeme" Storage> 29 ** See help for chunker backend at: https://rclone.org/chunker/ ** Remote to chunk/unchunk. Normally should contain a ':' and a path, eg "myremote:path/to/dir", "myremote:bucket" or maybe "myremote:" (not recommended). Enter a string value. Press Enter for the default (""). remote> box: Files larger than chunk size will be split in chunks. Enter a size with suffix k,M,G,T. Press Enter for the default ("2G"). chunk_size> 10G Choose how chunker handles hash sums. All modes but "none" require metadata. Enter a string value. Press Enter for the default ("md5"). Choose a number from below, or type in your own value 1 / Pass any hash supported by wrapped remote for non-chunked files, return nothing otherwise \ "none" 2 / MD5 for composite files \ "md5" 3 / SHA1 for composite files \ "sha1" 4 / MD5 for all files \ "md5all" 5 / SHA1 for all files \ "sha1all" 6 / Copying a file to chunker will request MD5 from the source falling back to SHA1 if unsupported \ "md5quick" 7 / Similar to "md5quick" but prefers SHA1 over MD5 \ "sha1quick" hash_type> 1 Edit advanced config? (y/n) y) Yes n) No y/n> n Remote config -------------------- [box_chunked] type = chunker remote = box: chunk_size = 10G hash_type = none -------------------- y) Yes this is OK e) Edit this remote d) Delete this remote y/e/d> y Current remotes: Name Type ==== ==== box box box_chunked chunker e) Edit existing remote n) New remote d) Delete remote r) Rename remote c) Copy remote s) Set configuration password q) Quit config e/n/d/r/c/s/q> qIf we now upload a 32GB file to Box (normally too large to upload), we will get four chunk files uploaded:
[user@helix]$ md5sum biggerfile 3fa85ee6755cc326c474bda600863e69 - [user@helix]$ rclone copy -P biggerfile box_chunked: Transferred: 32G / 32 GBytes, 100%, 30.765 MBytes/s, ETA 0s Errors: 0 Checks: 4 / 4, 100% Transferred: 1 / 1, 100% Elapsed time: 17m45.1s [user@helix]$ rclone ls box: 40 biggerfile 10737418240 biggerfile.rclone_chunk.001 10737418240 biggerfile.rclone_chunk.002 10737418240 biggerfile.rclone_chunk.003 2147483648 biggerfile.rclone_chunk.004 [user@helix]$ rclone ls box_chunked: 34359738368 biggerfile
When you access this via the Box website or the normal Box rclone remote, you will see a placeholder file as well as the individual chunk files named *.rclone_chunk.###. If you access your Box with the "chunker" remote, it will transparently reassemble the files for you as shown above.
If you need to access your data without rclone, you can reassemble the files manually. First, download the files from Box however you prefer. Then, concatenate the chunks together with cat:
[user@othermachine]$ ls -la biggerfile* .rw-r--r-- user user 40 B Fri Sep 16 11:17:18 2022 biggerfile .rw-r--r-- user user 10 GB Fri Sep 16 11:17:18 2022 biggerfile.rclone_chunk.001 .rw-r--r-- user user 10 GB Fri Sep 16 11:17:18 2022 biggerfile.rclone_chunk.002 .rw-r--r-- user user 10 GB Fri Sep 16 11:17:18 2022 biggerfile.rclone_chunk.003 .rw-r--r-- user user 2.0 GB Fri Sep 16 11:17:18 2022 biggerfile.rclone_chunk.004 [user@othermachine]$ cat biggerfile {"ver":1,"size":34359738368,"nchunks":4} [user@othermachine]$ cat biggerfile.rclone_chunk.* > biggerfile [user@othermachine]$ md5sum biggerfile 3fa85ee6755cc326c474bda600863e69 -