Biowulf High Performance Computing at the NIH
Transferring data between NIH Box or NIH OneDrive and HPC systems

Box is one of the collaboration tools provided by NIH. It can be used for collaboration and file sharing with NIH users as well as users outside the NIH. Files can be directly transfered to/from your NIH box to HPC systems storage.

OneDrive is the Microsoft cloud service which is part of NIH Microsoft 365 Subscription. Files can be shared within NIH and all NIH users have access

Transferring data between NIH box and HPC storage with rclone

Configuring rclone to access NIH Box

To use rclone on the HPC systems to access your NIH Box some configuration is required. During this configuration rclone will open a local browser window on the system where the configuration is done. This is therefore easiest done from within a NoMachine NX session on helix.

The configuration for rclone is stored in ~/.config/rclone/rclone.conf and can optionally be encrypted. This can be done when first setting up the configuration or later in a new rclone config session.

If you forget your encryption password, you will have to delete your rclone.conf file and start over.

In a terminal in the NX session on helix. My answers to the interactive prompts are highlighed in yellow.

(helix)$ module load rclone
(helix)$ rclone config
2019/09/03 09:20:27 NOTICE: Config file "/home/user/.config/rclone/rclone.conf" not found - using defaults
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> s
Your configuration is not encrypted.
If you add a password, you will protect your login information to cloud services.
a) Add Password
q) Quit to main menu
a/q> a
Enter NEW configuration password:
password: ******
Confirm NEW configuration password:
password: ******
Password set
Your configuration is encrypted.
c) Change Password
u) Unencrypt configuration
q) Quit to main menu
c/u/q> q
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> box
Type of storage to configure.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 1 / A stackable unification remote, which can appear to merge the contents of several remotes
   \ "union"
 2 / Alias for a existing remote
   \ "alias"
 3 / Amazon Drive
   \ "amazon cloud drive"
 4 / Amazon S3 Compliant Storage Provider (AWS, Alibaba, Ceph, Digital Ocean, Dreamhost, IBM COS, Minio, etc)
   \ "s3"
 5 / Backblaze B2
   \ "b2"
 6 / Box
   \ "box"
 7 / Cache a remote
   \ "cache"
 8 / Dropbox
 ...
 Storage> 6
** See help for box backend at: https://rclone.org/box/ **

Box App Client Id.
Leave blank normally.
Enter a string value. Press Enter for the default ("").
client_id>
Box App Client Secret
Leave blank normally.
Enter a string value. Press Enter for the default ("").
client_secret>
Edit advanced config? (y/n)
y) Yes
n) No
y/n> n
Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> y
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...

At this point rclone will open a browser window:

rclone/box authorization step 1

After choosing single sign on you will be prompted to enter your email address. Please use your @mail.nih.gov or @nih.gov address.

rclone/box authorization step 2

You may be redirected to an intermediate microsoft signin page. Either enter your mail again or, select it from the available options if it is present already.

rclone/box authorization step 2b

Then you should see an NIH login page where you will have to use NIH\username with your username:

rclone/box authorization step 3

Once you log in, box will ask you if it should grant access to rclone. Please note that this token will expire and needs to be renewed.

rclone/box authorization step 4

You should see a success message in the browser and the rclone config will resume in the terminal

rclone/box authorization step 4
Got code
--------------------
[box]
type = box
token = {"access_token":"xxxx","token_type":"bearer","refresh_token":"xxxx","expiry":"..."}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:

Name                 Type
====                 ====
box                  box

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q

Using rclone to transfer files

A simple example of rclone usage is shown below. Please see the official rclone documentation for more details. Since the configuration in this example is encrypted, the environment variable RCLONE_CONFIG_PASS is used to avoid having to re-type the configuration password for every action.

(helix)$ module load rclone
(helix)$ read -rs RCLONE_CONFIG_PASS
      #Here you should enter the rclone configuration password you set up earlier.
(helix)$ export RCLONE_CONFIG_PASS

(helix)$ rclone mkdir box:bam_files
(helix)$ rclone copy ENCFF034UET.bam box:bam_files
(helix)$ rclone copy --progress gcat_set_053.bam  box:bam_files
Transferred:        5.942G / 5.942 GBytes, 100%, 46.246 MBytes/s, ETA 0s
Errors:                 0
Transferred:            1 / 1, 100%
Elapsed time:     2m11.5s

(helix)$ rclone sync --progress --update 0_fast5 box:0_fast5  # interrupted with Ctrl-C
Transferred:        2.208G / 4.230 GBytes, 52%, 12.301 MBytes/s, ETA 2m48s
Errors:                 0
Checks:                 0 / 0, -
Transferred:          300 / 8001, 4%
Elapsed time:      3m3.7s
...

(helix)$ rclone sync --progress --update 0_fast5 box:0_fast5  # resumed the sync
...

(helix)$ rclone ls box:
   903112 biowulf-2000.png
9287173845 bam_files/ENCFF034UET.bam
2278307840 0_fast5/3_test_data.tar
...
(helix)$ rclone ls --max-depth=1 box:
   903112 biowulf-2000.png
(helix)$ rclone lsd box:
          -1 2019-09-03 09:07:55        -1 0_fast5
          -1 2019-09-03 08:56:47        -1 bam_files
(helix)$ rclone delete box:0_fast5
(helix)$ rclone lsd box:
          -1 2019-09-03 08:56:47        -1 bam_files

Transferring large numbers of small files is less performant with much lower transfer rates. This can be improved somewhat by increasing the number of concurrent transfers and checkers with --checkers 128 --transfers 128, but a better solution is to upload a tar file instead. This can be done without creating a tar file on disk:

(helix)$ tar -cz directory | rclone -P rcat box:directory.tar.gz

Shared data

Files shared with users via their NIH Box account are available for copy/sync/... in the same way as files uploaded by users themselves. For example

(helix)$ rclone lsd box:
      -1 2019-12-04 12:00:02        -1 TYCTWD              ### <-- this folder was shared with me
      -1 2019-11-25 08:02:28        -1 bam_files
Transferring data between NIH OneDrive and HPC storage with rclone

Configuring rclone to access OneDrive

rclone can be configured to access NIH OneDrive following the same workflow as above for NIH Box. Here is the rclone configuration session:

[user@helix]$ rclone config
Current remotes:

Name                 Type
====                 ====
box                  box

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> n
name> onedrive
Type of storage to configure.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 1 / A stackable unification remote, which can appear to merge the contents of several remotes
   \ "union"
...
16 / Mega
   \ "mega"
17 / Microsoft Azure Blob Storage
   \ "azureblob"
18 / Microsoft OneDrive
   \ "onedrive"
19 / OpenDrive
   \ "opendrive"
...
Storage> 18
** See help for onedrive backend at: https://rclone.org/onedrive/ **

Microsoft App Client Id
Leave blank normally.
Enter a string value. Press Enter for the default ("").
client_id>
Microsoft App Client Secret
Leave blank normally.
Enter a string value. Press Enter for the default ("").
client_secret> 
Edit advanced config? (y/n)
y) Yes
n) No
y/n> n
Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> y

At this point rclone will open a browser on helix and you will have to sign into the NIH similar to what is described described above for rclone/Box. Make sure to sign into your @nih.gov Microsoft account.

If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
Got code
Choose a number from below, or type in an existing value
 1 / OneDrive Personal or Business
   \ "onedrive"
 2 / Root Sharepoint site
   \ "sharepoint"
 3 / Type in driveID
   \ "driveid"
 4 / Type in SiteID
   \ "siteid"
 5 / Search a Sharepoint site
   \ "search"
Your choice> 1
Found 1 drives, please select the one you want to use:
0: OneDrive (business) id=...
Chose drive to use:> 0
Found drive 'root' of type 'business', URL: https://nih-my.sharepoint.com/personal/user_nih_gov/Documents
Is that okay?
y) Yes
n) No
y/n> y
--------------------
[onedrive]
type = onedrive
token = {"access_token":"...","token_type":"Bearer","refresh_token":"...","expiry":"2020-10-16T16:52:39.70344302-04:00"}
drive_id = ...
drive_type = business
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:

Name                 Type
====                 ====
box                  box
onedrive             onedrive

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q