Transferring data between NIH Box or NIH OneDrive and HPC systems

Box is one of the collaboration tools provided by NIH. It can be used for collaboration and file sharing with NIH users as well as users outside the NIH. Files can be directly transfered to/from your NIH box to HPC systems storage.

OneDrive is the Microsoft cloud service which is part of NIH Microsoft 365 Subscription. Files can be shared within NIH and all NIH users have access

Here is a summary of the similarities and differences between Box and OneDrive for NIH users.

Information obtained from NIH Box and OneDrive documentation, accurate as of 15 Sep 2022
BoxOneDrive
URL NIH Box NIH OneDrive
Documentation CIT-BOX documentation CIT-OneDrive documentation
Accounts Need to apply for an account via the online form All NIH personnel have automatic access
Space limit unlimited Initial allocation 1 TB, can request up to 5 TB
Individual file size limit 15 GB (note: rclone can chunk files) 250 GB (note: rclone can chunk files)
Data sharing Can share outside NIH Can only share within NIH
Access Can access from anywhere (NIH network or VPN not required) Must be on NIH network or VPN to access
Transfer to/from Helix/Biowulf via rclone (Globus connector coming...) via rclone or Globus (Note: individual file size limit -- 100 GB for Globus)
Sample transfer speeds from Helix using rclone 24 MB/s (1 GB file - 40 seconds)
35 MB/s (10 GB file - 5 mins)
9.5 MB/s (1 GB file - 1.6 mins)
9.3 MB/s (10 GB file - 17 mins)
Access/download trail Can track whether data has been accessed or downloaded ?

The transfer speeds above are samples. You may get higher or lower transfer speeds, depending on:

Transferring data between NIH box and HPC storage with rclone
back to top

Configuring rclone to access NIH Box

rclone has to be configured to connect to biowulf. Here is an outline of the steps needed:

rclone setup flowchart

During this process a token has to be copy and pasted. Some tokens are too long and result in an error during configuration or token renewal. If that happens you can carry out the entire configuration on your local system and then copy the resulting config file from your local system to ~/.config/rclone/rclone.conf (or just a section).

Install rclone on your own computer

Instructions and downloads can be found on the rclone download page.

Connect to helix with your ssh client and start configuration session

Connect to helix.nih.gov as usual and begin the rclone configuration step. The configuration for rclone is stored in ~/.config/rclone/rclone.conf and can optionally be encrypted. This can be done when first setting up the configuration or later in a new rclone config session. If you forget your encryption password, you will have to delete your rclone.conf file and start over.

The answers to the interactive prompts are highlighed in yellow below.

(helix)$ module load rclone
(helix)$ rclone config
2019/09/03 09:20:27 NOTICE: Config file "/home/user/.config/rclone/rclone.conf" not found - using defaults
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> s
Your configuration is not encrypted.
If you add a password, you will protect your login information to cloud services.
a) Add Password
q) Quit to main menu
a/q> a
Enter NEW configuration password:
password: ******
Confirm NEW configuration password:
password: ******
Password set
Your configuration is encrypted.
c) Change Password
u) Unencrypt configuration
q) Quit to main menu
c/u/q> q
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> box
Type of storage to configure.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
  1 / 1Fichier
   \ (fichier)
 2 / Akamai NetStorage
   \ (netstorage)
 3 / Alias for an existing remote
   \ (alias)
 4 / Amazon Drive
   \ (amazon cloud drive)
 5 / Amazon S3 Compliant Storage Providers including AWS, Alibaba, Ceph, China Mobile, Cloudflare, ArvanCloud, DigitalOcean, Dreamhost, Huawei OBS, IBM COS, IDrive e2, IONOS Cloud, Liara, Lyve Cloud, Minio, Netease, RackCorp, Scaleway, SeaweedFS, StackPath, Storj, Tencent COS, Qiniu and Wasabi
   \ (s3)
 6 / Backblaze B2
   \ (b2)
 7 / Better checksums for other remotes
   \ (hasher)
 8 / Box
   \ (box)
 ...
 Storage> box
** See help for box backend at: https://rclone.org/box/ **

Option client_id.
OAuth Client Id.
Leave blank normally.
Enter a value. Press Enter to leave empty.
client_id>

Option client_secret.
OAuth Client Secret.
Leave blank normally.
Enter a value. Press Enter to leave empty.
client_secret>

Option box_config_file.
Box App config.json location
Leave blank normally.
Leading `~` will be expanded in the file name as will environment variables such as `${RCLONE_CONFIG_DIR}`.
Enter a value. Press Enter to leave empty.
box_config_file>

Option access_token.
Box App Primary Access Token
Leave blank normally.
Enter a value. Press Enter to leave empty.
access_token>

Option box_sub_type.
Choose a number from below, or type in your own string value.
Press Enter for the default (user).
 1 / Rclone should act on behalf of a user.
   \ (user)
 2 / Rclone should act on behalf of a service account.
   \ (enterprise)
box_sub_type> 1

Edit advanced config?
y) Yes
n) No (default)
y/n> n

Use web browser to automatically authenticate rclone with remote?
 * Say Y if the machine running rclone has a web browser you can use
 * Say N if running rclone on a (remote) machine without web browser access
If not sure try Y. If Y failed, try N.

y) Yes (default)
n) No
y/n> n

Option config_token.
For this to work, you will need rclone available on a machine that has
a web browser available.
For more help and alternate methods see: https://rclone.org/remote_setup/
Execute the following on the machine with the web browser (same rclone
version recommended):
	rclone authorize "box"
Then paste the result.
Enter a value.
config_token> 

Authorize on your local machine and obtain an access token

On your local computer run rclone authorize box. For example on a windows system with powershell:

PS C:\Users\user\...\rclone-v1.58.1-windows-amd64> .\rclone.exe authorize box
2022/06/15 15:42:25 NOTICE: Config file "C:\\Users\\user\\AppData\\Roaming\\rclone\\rclone.conf" not found - using defaults
2022/06/15 15:42:25 NOTICE: If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth?state=...
2022/06/15 15:42:25 NOTICE: Log in and authorize rclone for access
2022/06/15 15:42:25 NOTICE: Waiting for code...

Note that on a Mac you may need an administrator to install and use rclone

At this point rclone will open a browser window:

rclone/box authorization step 1

After choosing single sign on you will be prompted to enter your email address. Please use your @mail.nih.gov or @nih.gov address.

rclone/box authorization step 2

You may be redirected to an intermediate microsoft signin page. Either enter your mail again or, select it from the available options if it is present already.

rclone/box authorization step 2b

Then you should see an NIH login page where you will have to use NIH\username with your username:

rclone/box authorization step 3

Once you log in, box will ask you if it should grant access to rclone. Please note that this token will expire and needs to be renewed.

rclone/box authorization step 4

You should see a success message in the browser

rclone/box authorization step 4

and the rclone config in the powershell window (or MacOS terminal) will resume and show something like this:

2022/06/15 15:43:17 NOTICE: Got code
Paste the following into your remote machine --->
{"access_token":"...","token_type":"bearer","refresh_token":"...","expiry":"..."}
<---End paste

Copy the code and paste it into the rclone config prompt on helix


> {"access_token":"...","token_type":"bearer","refresh_token":"...","expiry":"..."}
--------------------
Configuration complete.
Options:
type = box
box_sub_type = user
token = {"access_token":"xxxx","token_type":"bearer","refresh_token":"xxxx","expiry":"..."}
Keep this "box" remote?
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:

Name                 Type
====                 ====
box                  box

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q

Using rclone to transfer files

A simple example of rclone usage is shown below. Please see the official rclone documentation for more details. Since the configuration in this example is encrypted, the environment variable RCLONE_CONFIG_PASS is used to avoid having to re-type the configuration password for every action.

(helix)$ module load rclone
(helix)$ read -rs RCLONE_CONFIG_PASS
      #Here you should enter the rclone configuration password you set up earlier.
(helix)$ export RCLONE_CONFIG_PASS

(helix)$ rclone mkdir box:bam_files
(helix)$ rclone copy ENCFF034UET.bam box:bam_files
(helix)$ rclone copy --progress gcat_set_053.bam  box:bam_files
Transferred:        5.942G / 5.942 GBytes, 100%, 46.246 MBytes/s, ETA 0s
Errors:                 0
Transferred:            1 / 1, 100%
Elapsed time:     2m11.5s

(helix)$ rclone sync --progress --update 0_fast5 box:0_fast5  # interrupted with Ctrl-C
Transferred:        2.208G / 4.230 GBytes, 52%, 12.301 MBytes/s, ETA 2m48s
Errors:                 0
Checks:                 0 / 0, -
Transferred:          300 / 8001, 4%
Elapsed time:      3m3.7s
...

(helix)$ rclone sync --progress --update 0_fast5 box:0_fast5  # resumed the sync
...

(helix)$ rclone ls box:
   903112 biowulf-2000.png
9287173845 bam_files/ENCFF034UET.bam
2278307840 0_fast5/3_test_data.tar
...
(helix)$ rclone ls --max-depth=1 box:
   903112 biowulf-2000.png
(helix)$ rclone lsd box:
          -1 2019-09-03 09:07:55        -1 0_fast5
          -1 2019-09-03 08:56:47        -1 bam_files
(helix)$ rclone delete box:0_fast5
(helix)$ rclone lsd box:
          -1 2019-09-03 08:56:47        -1 bam_files

Transferring large numbers of small files is less performant with much lower transfer rates. This can be improved somewhat by increasing the number of concurrent transfers and checkers with --checkers 128 --transfers 128, but a better solution is to upload a tar file instead. This can be done without creating a tar file on disk:

(helix)$ tar -cz directory | rclone -P rcat box:directory.tar.gz

Shared data

Files shared with users via their NIH Box account are available for copy/sync/... in the same way as files uploaded by users themselves. For example

(helix)$ rclone lsd box:
      -1 2019-12-04 12:00:02        -1 TYCTWD              ### <-- this folder was shared with me
      -1 2019-11-25 08:02:28        -1 bam_files
Transferring data between NIH OneDrive and HPC storage with rclone
back to top

Configuring rclone to access OneDrive

Using rclone to access OneDrive requires authorization. Please contact hpc staff at staff@hpc.nih.gov to be added to the authorized group.

rclone can be configured to access NIH OneDrive following the same workflow as above for NIH Box. Here is the rclone configuration session:

[user@helix]$ rclone config
Current remotes:

Name                 Type
====                 ====
box                  box

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> n
name> onedrive
Type of storage to configure.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
30 / Microsoft Azure Blob Storage
   \ (azureblob)
31 / Microsoft OneDrive
   \ (onedrive)
32 / OpenDrive
   \ (opendrive)
...
Storage> onedrive
** See help for onedrive backend at: https://rclone.org/onedrive/ **

Option client_id.
OAuth Client Id.
Leave blank normally.
Enter a value. Press Enter to leave empty.
client_id>

Option client_secret.
OAuth Client Secret.
Leave blank normally.
Enter a value. Press Enter to leave empty.
client_secret>

Option region.
Choose national cloud region for OneDrive.
Choose a number from below, or type in your own string value.
Press Enter for the default (global).
 1 / Microsoft Cloud Global
   \ (global)
 2 / Microsoft Cloud for US Government
   \ (us)
 3 / Microsoft Cloud Germany
   \ (de)
 4 / Azure and Office 365 operated by Vnet Group in China
   \ (cn)
region> 1
Edit advanced config? (y/n)
y) Yes
n) No
y/n> n
Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> n
For this to work, you will need rclone available on a machine that has
a web browser available.
For more help and alternate methods see: https://rclone.org/remote_setup/
Execute the following on your machine (same rclone version recommended) :
        rclone authorize "onedrive"
Then paste the result below:
Enter a value.
config_token>

At this point use the same approach as above with your local rclone installation to obtain a authorization token and paste it into the config prompt

config_token> {"access_token":"...","expiry":"..."}
Option config_type.
Type of connection
Choose a number from below, or type in an existing value
 1 / OneDrive Personal or Business
   \ "onedrive"
 2 / Root Sharepoint site
   \ "sharepoint"
 3 / Type in driveID
   \ "driveid"
 4 / Type in SiteID
   \ "siteid"
 5 / Search a Sharepoint site
   \ "search"
Your choice> 1

Option config_driveid.
Select drive you want to use
Choose a number from below, or type in your own string value.
Press Enter for the default (b!uNjwdhQxy0aQz-OAHwa4CnJfGNEzxMpAkUmTvdBYoadkfRGG_wxBQbmo4JlG9JAp).
 1 / OneDrive (business)
 \ (...)
config_driveid>1

Drive OK?

Found drive "root" of type "business"
URL: https://nih-my.sharepoint.com/personal/user_nih_gov/Documents

y) Yes (default)
n) No
y/n> y

Configuration complete.
Options:
- type: onedrive
- region: global
- token: {"access_token":"...",,"expiry":"..."}

- drive_id: ...
- drive_type: business
Keep this "onedrive" remote?
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d>y

Current remotes:

Name                 Type
====                 ====
box                  box
onedrive             onedrive

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q



How to 'chunk' files via rclone

If the rclone destination has a limit on an individual file size (e.g. Box and Onedrive above), and you need to transfer a file larger than the limit, you can do so via the rclone transparent overlay option.

You will first need to set up the basic Box or OneDrive remote as described above. Next, you will add a second, wrapper remote using the "chunker" remote type. When using this remote, any files underneath the maximum size will be transferred normally, but larger files will be split into chunks.

An example for a Box remote set up as in our example using maximum 10GB files:

[user@helix]$ rclone config
Current remotes:

Name                 Type
====                 ====
box                  box

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> n
name> box_chunked
Type of storage to configure.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 1 / 1Fichier
   \ "fichier"
...
29 / Transparently chunk/split large files
   \ "chunker"
...
33 / http Connection
   \ "http"
34 / premiumize.me
   \ "premiumizeme"
Storage> 29
** See help for chunker backend at: https://rclone.org/chunker/ **
Remote to chunk/unchunk.
Normally should contain a ':' and a path, eg "myremote:path/to/dir",
"myremote:bucket" or maybe "myremote:" (not recommended).
Enter a string value. Press Enter for the default ("").
remote> box:
Files larger than chunk size will be split in chunks.
Enter a size with suffix k,M,G,T. Press Enter for the default ("2G").
chunk_size> 10G
Choose how chunker handles hash sums. All modes but "none" require metadata.
Enter a string value. Press Enter for the default ("md5").
Choose a number from below, or type in your own value
 1 / Pass any hash supported by wrapped remote for non-chunked files, return nothing otherwise
   \ "none"
 2 / MD5 for composite files
   \ "md5"
 3 / SHA1 for composite files
   \ "sha1"
 4 / MD5 for all files
   \ "md5all"
 5 / SHA1 for all files
   \ "sha1all"
 6 / Copying a file to chunker will request MD5 from the source falling back to SHA1 if unsupported
   \ "md5quick"
 7 / Similar to "md5quick" but prefers SHA1 over MD5
   \ "sha1quick"
hash_type> 1
Edit advanced config? (y/n)
y) Yes
n) No
y/n> n
Remote config
--------------------
[box_chunked]
type = chunker
remote = box:
chunk_size = 10G
hash_type = none
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:

Name                 Type
====                 ====
box                  box
box_chunked          chunker

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q
If we now upload a 32GB file to Box (normally too large to upload), we will get four chunk files uploaded:
[user@helix]$ md5sum biggerfile
3fa85ee6755cc326c474bda600863e69  -

[user@helix]$ rclone copy -P biggerfile box_chunked:
Transferred:           32G / 32 GBytes, 100%, 30.765 MBytes/s, ETA 0s
Errors:                 0
Checks:                 4 / 4, 100%
Transferred:            1 / 1, 100%
Elapsed time:    17m45.1s

[user@helix]$ rclone ls box:
       40 biggerfile
10737418240 biggerfile.rclone_chunk.001
10737418240 biggerfile.rclone_chunk.002
10737418240 biggerfile.rclone_chunk.003
2147483648 biggerfile.rclone_chunk.004

[user@helix]$ rclone ls box_chunked:
34359738368 biggerfile

When you access this via the Box website or the normal Box rclone remote, you will see a placeholder file as well as the individual chunk files named *.rclone_chunk.###. If you access your Box with the "chunker" remote, it will transparently reassemble the files for you as shown above.

If you need to access your data without rclone, you can reassemble the files manually. First, download the files from Box however you prefer. Then, concatenate the chunks together with cat:

[user@othermachine]$ ls -la biggerfile*
.rw-r--r-- user user  40 B  Fri Sep 16 11:17:18 2022  biggerfile
.rw-r--r-- user user  10 GB Fri Sep 16 11:17:18 2022  biggerfile.rclone_chunk.001
.rw-r--r-- user user  10 GB Fri Sep 16 11:17:18 2022  biggerfile.rclone_chunk.002
.rw-r--r-- user user  10 GB Fri Sep 16 11:17:18 2022  biggerfile.rclone_chunk.003
.rw-r--r-- user user 2.0 GB Fri Sep 16 11:17:18 2022  biggerfile.rclone_chunk.004

[user@othermachine]$ cat biggerfile
{"ver":1,"size":34359738368,"nchunks":4}

[user@othermachine]$ cat biggerfile.rclone_chunk.* > biggerfile

[user@othermachine]$ md5sum biggerfile
3fa85ee6755cc326c474bda600863e69  -