Object storage

Alert: The original NIH HPC/Biowulf object storage system has been decommissioned. The HPC staff migrated any data residing in vaults on the Cleverafe object storage system to a new tape-based SpectraLogic BlackPearl system. The documentation on this page relates to the new BlackPearl system.

Overview of object storage

Object storage is a technology that manages data and any associated metadata as objects with unique identifiers in a non-hierarchical structure. This is in contrast to traditional storage systems where data is managed as files in a structure of hierarchical directories. It is popular for cloud and Internet based applications such as photo sharing, and it has recently become popular for scientific applications that store large amounts of raw data.

There are two main reasons that object storage is suitable for these data intensive applications:

Despite its several advantages, there are some limitations and caveats for using object storage:

back to top
The NIH HPC Systems object store

The NIH HPC staff have deployed an object storage system. The following table outlines some of the key differences between traditional file systems and object storage that are specifically applicable to the HPC systems.

Networked file systems (e.g. /data, /scratch)Object storage
Locating data Files exist within a hierarchy of directories and are accessed by path name Objects are stored in a flat namespace and are accessed by a unique identifier
Reading and writing data Uses standard Linux commands (cat, grep, vi, emacs, etc.) to read and write data. Programs can access files by name. Objects can only be accessed by programs that call specialized functions.
Data storage and resiliancy Data is stored on a more tightly coupled group of disks within the same data center. The system is engineered such that failures of individual disk drives or other components will not cause data loss. Data is stored on many, loosely coupled servers that may be geographically distributed to ensure very high reliability. The system is engineered such that multiple servers can fail and data will not be lost.
Performance Highly optimized for a parallel computing workload. Uses fast, high-quality components. Favors a more distributed workload. Components are high-quality, but performance is limited by network bandwidth between storage devices.

Object storage cannot be accessed in the same manner as file based storage. Most users will use the application programming interface (API) to store and retrieve data from the system. We have developed a set of programs that provide basic input and output operations (put, get, delete, list) on the object storage using its API for those users who do not wish to write their own programs. There are also a number of other utilities, such as Rclone, that can use the S3 protocol to talk to object storage systems.

back to top
Changes from the previous object store

There are a number of important changes between the older object store implementation (retired in Summer, 2023) and the present system:

back to top
Requesting access to the object storage

While the HPC staff is transitioning from the old to the new object storage implementation, no new object storage allocations are being granted except in extreme emergencies. If you urgently need object storage space, please e-mail staff@hpc.nih.gov.

back to top
How to use the object storage

The object storage system provides an application programming interface (API) for the management of data. This API is identical to the one first used by Amazon's Simple Storage Service (S3) and now supported by a large number of vendors. The most convenient way to program against it is to use a library for your preferred language that knows how to use the S3 protocol, for example Python's Boto3 module. Developer-specific information about S3 and boto3 is available online.

If you would rather not write your own programs, the HPC staff have developed simple utility scripts that support most day-to-day file management functions. These utilities can be used interactively or within batch scripts to manage data on the object storage system. There are other clients that can access the object storage as well.

Connecting to the object store via its API requires the use of an API key, which is a set of credentials used by the system to identify users. This key will be provided to you by the HPC staff using encrypted or secure e-mail. It is important that you protect these credentials and make sure that no one else gains access to them.

The API key consists of two parts, a key ID and a secret access key. Once you have received these from the HPC staff, you should create a file called /home/$USER/.boto3 (replace $USER with your user name) having the following contents:

[Credentials]
aws_access_key_id = your_key_id
aws_secret_access_key = your_secret_key

Make sure that this file is only readable by you by issuing the following command:
chmod 0400 /home/$USER/.boto3

In many cases, the HPC staff will be able to create the .boto3 script for you. In these cases, you do not have to create it yourself.

back to top
Helper scripts for API access

Because many users will access the object storage via its API, we have developed scripts that incorporate many file management operations (putting data on the object store, retrieving it, listing objects, etc.). These scripts allow users to manipulate data on the object store without writing their own programs. They also can be used as examples for users who wish to write their own programs that will access the object store. These scripts are written in Python, and described below. In addition, there are various tools that have been written to access object storage. Please e-mail staff@hpc.nih.gov if you are interested in having them installed.

Please note that all scripts support a help option (--help for obj2 df and obj2 ls).

IMPORTANT NOTE: If your bucket name is different from your user name, remember to add the -b <BUCKETNAME> flag to your obj2 command.

NameDescriptionNotes
obj2 lsLists objects on the storage Listing objects can be slow if there is a large number of them. Please use this sparingly, and find other mechanisms for keeping track of objects that you have stored.
obj2 putCopies data from a file storage system such as /data or /home to the object store. Supports wildcards. Note that copied data is not deleted from the source file system and will still count against any quotas.
obj2 getCopies data from the object store to a file system file or standard output. Also supports wildcards. There are several options for choosing where data from the object store is placed on the filesystem. If obj2 put was used to copy the data to the object store, obj2 get can attempt to restore the original file modification time and permissions. Data can also be written to standard output and piped to applications.
obj2 rmRemoves data from the object store. Removes an object or objects from the object store. For safety, only the "?" wildcard character is supported (matches any single character). Please note that there are no back-ups or snapshots of the object store, so once an object is deleted, it is gone forever.
obj2 dfShows space utilization on the object store for all accessible buckets. Please be careful not to exceed the amount of space allocated to you; if you do, you will not be able to store new objects.
back to top
Using rclone and the AWS CLI

Because the object store uses reasonably standard S3, it is possible to configure various clients to work with it. Of these, the HPC staff support rclone and the AWS CLI.

Rclone setup instructions

The following example shows how to set up rclone to work with the object store:

helix$ module load rclone
[+] Loading rclone  1.62.2 
helix$ rclone config
Current remotes:

Name                 Type
====                 ====
example              s3

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> n

Enter name for new remote.
name> hpc-object

Option Storage.
Type of storage to configure.
Choose a number from below, or type in your own value.
 1 / 1Fichier
   \ (fichier)
 2 / Akamai NetStorage
   \ (netstorage)
 3 / Alias for an existing remote
   \ (alias)
 4 / Amazon Drive
   \ (amazon cloud drive)
 5 / Amazon S3 Compliant Storage Providers including AWS, Alibaba, Ceph, China Mobile, Cloudflare, ArvanCloud, DigitalOcean, Dreamhost, Huawei OBS, IBM COS, IDrive e2, IONOS Cloud, Liara, Lyve Cloud, Minio, Netease, RackCorp, Scaleway, SeaweedFS, StackPath, Storj, Tencent COS, Qiniu and Wasabi
   \ (s3)
 6 / Backblaze B2
   \ (b2)
 7 / Better checksums for other remotes
   \ (hasher)
 8 / Box
   \ (box)
 9 / Cache a remote
   \ (cache)
10 / Citrix Sharefile
   \ (sharefile)
11 / Combine several remotes into one
   \ (combine)
12 / Compress a remote
   \ (compress)
13 / Dropbox
   \ (dropbox)
14 / Encrypt/Decrypt a remote
   \ (crypt)
15 / Enterprise File Fabric
   \ (filefabric)
16 / FTP
   \ (ftp)
17 / Google Cloud Storage (this is not Google Drive)
   \ (google cloud storage)
18 / Google Drive
   \ (drive)
19 / Google Photos
   \ (google photos)
20 / HTTP
   \ (http)
21 / Hadoop distributed file system
   \ (hdfs)
22 / HiDrive
   \ (hidrive)
23 / In memory object storage system.
   \ (memory)
24 / Internet Archive
   \ (internetarchive)
25 / Jottacloud
   \ (jottacloud)
26 / Koofr, Digi Storage and other Koofr-compatible storage providers
   \ (koofr)
27 / Local Disk
   \ (local)
28 / Mail.ru Cloud
   \ (mailru)
29 / Mega
   \ (mega)
30 / Microsoft Azure Blob Storage
   \ (azureblob)
31 / Microsoft OneDrive
   \ (onedrive)
32 / OpenDrive
   \ (opendrive)
33 / OpenStack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
   \ (swift)
34 / Oracle Cloud Infrastructure Object Storage
   \ (oracleobjectstorage)
35 / Pcloud
   \ (pcloud)
36 / Put.io
   \ (putio)
37 / QingCloud Object Storage
   \ (qingstor)
38 / SMB / CIFS
   \ (smb)
39 / SSH/SFTP
   \ (sftp)
40 / Sia Decentralized Cloud
   \ (sia)
41 / Storj Decentralized Cloud Storage
   \ (storj)
42 / Sugarsync
   \ (sugarsync)
43 / Transparently chunk/split large files
   \ (chunker)
44 / Union merges the contents of several upstream fs
   \ (union)
45 / Uptobox
   \ (uptobox)
46 / WebDAV
   \ (webdav)
47 / Yandex Disk
   \ (yandex)
48 / Zoho
   \ (zoho)
49 / premiumize.me
   \ (premiumizeme)
50 / seafile
   \ (seafile)
Storage> 5

Option provider.
Choose your S3 provider.
Choose a number from below, or type in your own value.
Press Enter to leave empty.
 1 / Amazon Web Services (AWS) S3
   \ (AWS)
 2 / Alibaba Cloud Object Storage System (OSS) formerly Aliyun
   \ (Alibaba)
 3 / Ceph Object Storage
   \ (Ceph)
 4 / China Mobile Ecloud Elastic Object Storage (EOS)
   \ (ChinaMobile)
 5 / Cloudflare R2 Storage
   \ (Cloudflare)
 6 / Arvan Cloud Object Storage (AOS)
   \ (ArvanCloud)
 7 / DigitalOcean Spaces
   \ (DigitalOcean)
 8 / Dreamhost DreamObjects
   \ (Dreamhost)
 9 / Huawei Object Storage Service
   \ (HuaweiOBS)
10 / IBM COS S3
   \ (IBMCOS)
11 / IDrive e2
   \ (IDrive)
12 / IONOS Cloud
   \ (IONOS)
13 / Seagate Lyve Cloud
   \ (LyveCloud)
14 / Liara Object Storage
   \ (Liara)
15 / Minio Object Storage
   \ (Minio)
16 / Netease Object Storage (NOS)
   \ (Netease)
17 / RackCorp Object Storage
   \ (RackCorp)
18 / Scaleway Object Storage
   \ (Scaleway)
19 / SeaweedFS S3
   \ (SeaweedFS)
20 / StackPath Object Storage
   \ (StackPath)
21 / Storj (S3 Compatible Gateway)
   \ (Storj)
22 / Tencent Cloud Object Storage (COS)
   \ (TencentCOS)
23 / Wasabi Object Storage
   \ (Wasabi)
24 / Qiniu Object Storage (Kodo)
   \ (Qiniu)
25 / Any other S3 compatible provider
   \ (Other)
provider> 25

Option env_auth.
Get AWS credentials from runtime (environment variables or EC2/ECS meta data if no env vars).
Only applies if access_key_id and secret_access_key is blank.
Choose a number from below, or type in your own boolean value (true or false).
Press Enter for the default (false).
 1 / Enter AWS credentials in the next step.
   \ (false)
 2 / Get AWS credentials from the environment (env vars or IAM).
   \ (true)
env_auth> 1

Option access_key_id.
AWS Access Key ID.
Leave blank for anonymous access or runtime credentials.
Enter a value. Press Enter to leave empty.
access_key_id> enter your access key ID here

Option secret_access_key.
AWS Secret Access Key (password).
Leave blank for anonymous access or runtime credentials.
Enter a value. Press Enter to leave empty.
secret_access_key> enter your secret access key here

Option region.
Region to connect to.
Leave blank if you are using an S3 clone and you don't have a region.
Choose a number from below, or type in your own value.
Press Enter to leave empty.
   / Use this if unsure.
 1 | Will use v4 signatures and an empty region.
   \ ()
   / Use this only if v4 signatures don't work.
 2 | E.g. pre Jewel/v10 CEPH.
   \ (other-v2-signature)
region> 1

Option endpoint.
Endpoint for S3 API.
Required when using an S3 clone.
Enter a value. Press Enter to leave empty.
endpoint> https://bpds3:8443

Option location_constraint.
Location constraint - must be set to match the Region.
Leave blank if not sure. Used when creating buckets only.
Enter a value. Press Enter to leave empty.
location_constraint> 

Option acl.
Canned ACL used when creating buckets and storing or copying objects.
This ACL is used for creating objects and if bucket_acl isn't set, for creating buckets too.
For more info visit https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl
Note that this ACL is applied when server-side copying objects as S3
doesn't copy the ACL from the source but rather writes a fresh one.
If the acl is an empty string then no X-Amz-Acl: header is added and
the default (private) will be used.
Choose a number from below, or type in your own value.
Press Enter to leave empty.
   / Owner gets FULL_CONTROL.
 1 | No one else has access rights (default).
   \ (private)
   / Owner gets FULL_CONTROL.
 2 | The AllUsers group gets READ access.
   \ (public-read)
   / Owner gets FULL_CONTROL.
 3 | The AllUsers group gets READ and WRITE access.
   | Granting this on a bucket is generally not recommended.
   \ (public-read-write)
   / Owner gets FULL_CONTROL.
 4 | The AuthenticatedUsers group gets READ access.
   \ (authenticated-read)
   / Object owner gets FULL_CONTROL.
 5 | Bucket owner gets READ access.
   | If you specify this canned ACL when creating a bucket, Amazon S3 ignores it.
   \ (bucket-owner-read)
   / Both the object owner and the bucket owner get FULL_CONTROL over the object.
 6 | If you specify this canned ACL when creating a bucket, Amazon S3 ignores it.
   \ (bucket-owner-full-control)
acl>  

Edit advanced config?
y) Yes
n) No (default)
y/n> n

Configuration complete.
Options:
- type: s3
- provider: Other
- access_key_id: not shown
- secret_access_key: not shown
- endpoint: https://bpds3:8443
Keep this "hpc-object" remote?
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y

Current remotes:

Name                 Type
====                 ====
hpc-object           s3
example              s3

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q

Using rclone:

helix$ rclone ls hpc-object:bucketname
      210 object1
     9102 object2
     9102 object3

Important usage note: Please add the --s3-no-check-bucket flag to rclone when putting data into a bucket. Without it, you may encounter permission errors.

More documentation is available on our rclone web page or by running rclone --help.

The AWS CLI

Initial set-up:

Note: Naming your AWS profile is optional. Feel free to use the default profile to access the object store. If you wish to do this, leave the --profile option out of the aws configure command.


helix$ ml aws
[+] Loading aws  current  on helix.nih.gov 
helix$ aws configure --profile hpc-object
AWS Access Key ID [None]: enter your key ID here
AWS Secret Access Key [None]: enter your secret key here
Default region name [None]: 
Default output format [None]: 

Using the AWS CLI:

Note: Omit the --profile flag from the below if you are using the default profile. Also, the --endpoint-url must be specified on every call.


helix$ aws s3 --endpoint-url=https://bpds3:8443 --profile=hpc-object ls s3://bucketname
2023-04-24 15:33:21        210 object1
2023-04-24 15:33:21       9102 object2
2023-04-24 15:33:21       9102 pbject3

Further information about the AWS CLI may be found via its documentation. The s3 and s3api subcommands and their options are relevant for interactive with the object store.

back to top
Object storage policies

Because object storage is different from file based storage, thare are some different rules and policies surrounding its access. However, all of the NIH HPC policies related to storage apply to the object store. Note that the object store may be used for storage of relatively inactive data that still needs to be kept on the HPC systems.

Object storage-specific policies are:

back to top