Transferring data between Microsoft Azure Blob Storage and HPC systems

Using azcopy
azcopy is a command-line utility to copy blobs to or from Azure storage. It is already installed on the HPC systems and can be accessed with 'module load azcopy'.

helix% module load azcopy

helix% azcopy login
# get send back a link that you have to paste in your browser and then you can enter the code XXXX 
# to authenticate. Next you may sign in with your NIH email.

helix% azcopy cp "https://[account].blob.core.windows.net/[container]/onefile" .

helix% azcopy cp /data/$USER/mydir "https://[account].blob.core.windows.net/[container]/directory" --recursive 
Using the full Azure CLI

The Azure CLI is installed and available as a module (azure_cli) which can be loaded on Helix or the Biowulf compute nodes.

Sample session:

[user@helix ~]$ module load azure_cli

[user@helix ~]$ az login
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ABCDEFGH to authenticate.

---- After successful login .... ------------------------
[
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "xxxxxxxxxxxxxxxxxxxxxxx",
    "id": "yyyyyyyyyyyyyyyyy",
    "isDefault": true,
    "managedByTenants": [],
    "name": "NIH.Some.Name",
    "state": "Enabled",
    "tenantId": "zzzzzzzzzzzzzzzzzzzzzzzzz",
    "user": {
      "name": "user@nih.gov",
      "type": "user"
    }
  }
]

# get a list of containers in a Azure Blob storage account
[user@helix ~]$ az storage container list --account-name my-azure-storage-account-name -account-key=my-account-key
[
  {
    "deleted": null,
    "encryptionScope": {
      "defaultEncryptionScope": "$account-encryption-key",
      "preventEncryptionScopeOverride": false
    },
      [...]
      "lastModified": "2023-08-14T17:06:58+00:00",
      },
      "publicAccess": null
    },
    "version": null
  }
]
Set some environment variables to avoid having to enter them on the CLI command line
[user@helix ~]$ export AZURE_STORAGE_ACCOUNT=my-azure-storage-account-name
[user@helix ~]$ export AZURE_STORAGE_KEY=my-account-key

See blobs in a container

[user@helix ~]$ az storage blob list -c container1
[
  {
    "container": "container1",
         [...]
    },
    [...]
    "name": "myfile.jpg",
     [...]
      },
      "copy": {
       [...]
      },
      "creationTime": "2023-08-14T17:07:00+00:00",
      "deletedTime": null,
      "etag": "0x8DB9CE8DFE6341E",
      "lastModified": "2023-08-14T17:07:00+00:00",
      "lease": {
       [...]
      },
      [...]
    },
      [...]
  }
]

Uploading data:

# Upload a file:
# sample speed: 100 GB file in 78 mins
az storage blob upload -c container1   -f ./img_000000255.fits

# Upload a directory:
az storage blob upload-batch -d container1   -s ./test_data/

Documentation for storage-related Azure CLI commands can be seen at the Azure doc site.

Transferring via rclone

There are multiple ways to deal with Azure authentication. See the rclone Azure auth page for more info.

In the example below, authentication has been set up using account-name/key.

[user@helix ~]$ module load rclone
[+] Loading rclone  1.62.2

[user@helix ~]$ rclone config
Enter configuration password:
password:
e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> n

Enter name for new remote.
name> azureblob

Option Storage.
Type of storage to configure.
Choose a number from below, or type in your own value.
[...]
30 / Microsoft Azure Blob Storage
   \ (azureblob)
Storage> 30

Option account.
Azure Storage Account Name.
Set this to the Azure Storage Account Name in use.
Leave blank to use SAS URL or Emulator, otherwise it needs to be set.
If this is blank and if env_auth is set it will be read from the
environment variable `AZURE_STORAGE_ACCOUNT_NAME` if possible.
Enter a value. Press Enter to leave empty.
account>myazurestorageaccount

Option env_auth.
Read credentials from runtime (environment variables, CLI or MSI).
See the [authentication docs](/azureblob#authentication) for full info.
Enter a boolean value (true or false). Press Enter for the default (false).
env_auth> false

Option key.
Storage Account Shared Key.
Leave blank to use SAS URL or Emulator.
Enter a value. Press Enter to leave empty.
key> myazurestoragekey

Option sas_url.
SAS URL for container level access only.
Leave blank if using account/key or Emulator.
Enter a value. Press Enter to leave empty.
sas_url>

Option tenant.
ID of the service principal's tenant. Also called its directory ID.
Set this if using
- Service principal with client secret
- Service principal with certificate
- User with username and password
Enter a value. Press Enter to leave empty.
tenant>

Option client_id.
The ID of the client in use.
Set this if using
- Service principal with client secret
- Service principal with certificate
- User with username and password
Enter a value. Press Enter to leave empty.
client_id>

Option client_secret.
One of the service principal's client secrets
Set this if using
- Service principal with client secret
Enter a value. Press Enter to leave empty.
client_secret>

Option client_certificate_path.
Path to a PEM or PKCS12 certificate file including the private key.
Set this if using
- Service principal with certificate
Enter a value. Press Enter to leave empty.
client_certificate_path>

Option client_certificate_password.
Password for the certificate file (optional).
Optionally set this if using
- Service principal with certificate
And the certificate has a password.
Choose an alternative below. Press Enter for the default (n).
y) Yes, type in my own password
g) Generate random password
n) No, leave this optional password blank (default)
y/g/n> n

Edit advanced config?
y) Yes
n) No (default)
y/n> n

Configuration complete.
Options:
- type: azureblob
- account: myazurestorageaccount
- key: myazurestoragekey
Keep this "azureblob" remote?
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y

Current remotes:

Name                 Type
====                 ====
azureblob            azureblob
box                  box
onedrive             onedrive

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q
Now that rclone is configured, it can be used to list blobs in a container on Azure Blob Storage. e.g.
[user@helix ~]$ rclone ls azureblob:container2
Enter configuration password:
password:
     4678 img_000000248.fits
Transfer a directory.
#To avoid entering the rclone configuration password each time, you can store it in a variable. 
[user@helix ~]$ export RCLONE_CONFIG_PASS=myrcloneconfigpass

# Upload a directory to Azure
[user@helix ~]$ rclone mkdir azureblob:container3
[user@helix ~]$ rclone sync --progress --update ./50GB-in-medium-files/ azureblob:container3
Transferred:   	   46.566 GiB / 46.566 GiB, 100%, 68.300 MiB/s, ETA 0s
Checks:               304 / 304, 100%
Deleted:              304 (files), 0 (dirs)
Transferred:         1875 / 1875, 100%
Elapsed time:     10m16.9s

# Download a set of files from Azure
[user@helix ~]$ rclone sync --progress --update azureblob:container3 ./test_data

Transferring via Globus

The HPC staff are in the process of setting up a Globus connector for Azure Blob Storage.