Object storage

Quick Links

Overview of object storage

The NIH HPC Systems object store

Changes from the previous object store

Reqesting access to the object store

How to use the object store

Helper scripts for API access

Using rclone and the AWS CLI

Object storage policies

Alert: The original NIH HPC/Biowulf object storage system has been decommissioned. The HPC staff migrated any data residing in vaults on the Cleverafe object storage system to a new tape-based SpectraLogic BlackPearl system. The documentation on this page relates to the new BlackPearl system.

Overview of object storage

Object storage is a technology that manages data and any associated metadata as objects with unique identifiers in a non-hierarchical structure. This is in contrast to traditional storage systems where data is managed as files in a structure of hierarchical directories. It is popular for cloud and Internet based applications such as photo sharing, and it has recently become popular for scientific applications that store large amounts of raw data.

There are two main reasons that object storage is suitable for these data intensive applications:

Object storage systems are designed to scale up to many hundreds or thousands of individual storage devices, each of which can contain many individual hard disks. Data is distributed across these disks in a fault-tolerant manner.
Object storage systems relax some of the constraints imposed by traditional file systems. For example, there is no hierarchical directory structure in an object store. Additionally, the object store does not guarantee that multiple clients will always get the most up-to-date copy of a given object.

Despite its several advantages, there are some limitations and caveats for using object storage:

It is designed for scalability and reliability as opposed to performance or consistency. As a result, object storage works best for data that will be written to the storage once and then read as a unit multiple times.
Unlike other storage systems maintained by the HPC staff, it is not designed for intensive, continuous input and output.
The present generation of object storage deployed on Biowulf is geared towards longer-term archiving of scientific data. As such, there may be high latencies (delays) on retrieving individual objects.
Object storage is not suited for random input and output; objects should be written and read as single discrete chunks of data.
The method of storing and accessing data is different from traditional file systems (more on this below).

The NIH HPC Systems object store

The NIH HPC staff have deployed an object storage system. The following table outlines some of the key differences between traditional file systems and object storage that are specifically applicable to the HPC systems.

	Networked file systems (e.g. /data, /scratch)	Object storage
Locating data	Files exist within a hierarchy of directories and are accessed by path name	Objects are stored in a flat namespace and are accessed by a unique identifier
Reading and writing data	Uses standard Linux commands (cat, grep, vi, emacs, etc.) to read and write data. Programs can access files by name.	Objects can only be accessed by programs that call specialized functions.
Data storage and resiliancy	Data is stored on a more tightly coupled group of disks within the same data center. The system is engineered such that failures of individual disk drives or other components will not cause data loss.	Data is stored on many, loosely coupled servers that may be geographically distributed to ensure very high reliability. The system is engineered such that multiple servers can fail and data will not be lost.
Performance	Highly optimized for a parallel computing workload. Uses fast, high-quality components.	Favors a more distributed workload. Components are high-quality, but performance is limited by network bandwidth between storage devices.

Object storage cannot be accessed in the same manner as file based storage. Most users will use the application programming interface (API) to store and retrieve data from the system. We have developed a set of programs that provide basic input and output operations (put, get, delete, list) on the object storage using its API for those users who do not wish to write their own programs. There are also a number of other utilities, such as Rclone, that can use the S3 protocol to talk to object storage systems.

Changes from the previous object store

There are a number of important changes between the older object store implementation (retired in Summer, 2023) and the present system:

The newer system uses higher capacity but slower media. For that reason, puts and retrievals may be slower. Therefore, we no longer recommend reading data directly to applications from the object store.
Storage areas on the new system are called buckets instead of vaults, which is consistent with terminology used by AWS S3.
The HPC-developed object storage scripts have been re-written to provide a better user interface. Note that the command syntax has changed, instead of using obj_command, obj2 command is now mandatory. E.g., obj_put is replaced with obj2 put. All commands have a --help option.
The AWS CLI tool is now an officially-supported client, along with the HPC-developed scripts and rclone. Globus support is planned for the future.

Requesting access to the object storage

Please e-mail staff@hpc.nih.gov to request space on the object storage system..

How to use the object storage

The object storage system provides an application programming interface (API) for the management of data. This API is identical to the one first used by Amazon's Simple Storage Service (S3) and now supported by a large number of vendors. The most convenient way to program against it is to use a library for your preferred language that knows how to use the S3 protocol, for example Python's Boto3 module. Developer-specific information about S3 and boto3 is available online.

If you would rather not write your own programs, the HPC staff have developed simple utility scripts that support most day-to-day file management functions. These utilities can be used interactively or within batch scripts to manage data on the object storage system. There are other clients that can access the object storage as well.

Connecting to the object store via its API requires the use of an API key, which is a set of credentials used by the system to identify users. This key will be provided to you by the HPC staff using encrypted or secure e-mail. It is important that you protect these credentials and make sure that no one else gains access to them.

The API key consists of two parts, a key ID and a secret access key. Once you have received these from the HPC staff, you should create a file called /home/$USER/.boto3 (replace $USER with your user name) having the following contents:

[Credentials]
aws_access_key_id = your_key_id
aws_secret_access_key = your_secret_key

Make sure that this file is only readable by you by issuing the following command:

chmod 0400 /home/$USER/.boto3

In many cases, the HPC staff will be able to create the .boto3 script for you. In these cases, you do not have to create it yourself.

Helper scripts for API access

Because many users will access the object storage via its API, we have developed scripts that incorporate many file management operations (putting data on the object store, retrieving it, listing objects, etc.). These scripts allow users to manipulate data on the object store without writing their own programs. They also can be used as examples for users who wish to write their own programs that will access the object store. These scripts are written in Python, and described below. In addition, there are various tools that have been written to access object storage. Please e-mail staff@hpc.nih.gov if you are interested in having them installed.

Please note that all scripts support a help option (--help for obj2 df and obj2 ls).

IMPORTANT NOTE: If your bucket name is different from your user name, remember to add the -b <BUCKETNAME> flag to your obj2 command.

Name	Description	Notes
`obj2 ls`	Lists objects on the storage	Listing objects can be slow if there is a large number of them. Please use this sparingly, and find other mechanisms for keeping track of objects that you have stored.
`obj2 put`	Copies data from a file storage system such as /data or /home to the object store.	Supports wildcards. Note that copied data is not deleted from the source file system and will still count against any quotas.
`obj2 get`	Copies data from the object store to a file system file or standard output.	Also supports wildcards. There are several options for choosing where data from the object store is placed on the filesystem. If `obj2 put` was used to copy the data to the object store, `obj2 get` can attempt to restore the original file modification time and permissions. Data can also be written to standard output and piped to applications.
`obj2 rm`	Removes data from the object store.	Removes an object or objects from the object store. For safety, only the "?" wildcard character is supported (matches any single character). Please note that there are no back-ups or snapshots of the object store, so once an object is deleted, it is gone forever.
`obj2 df`	Shows space utilization on the object store for all accessible buckets.	Please be careful not to exceed the amount of space allocated to you; if you do, you will not be able to store new objects.

Using rclone and the AWS CLI

Because the object store uses reasonably standard S3, it is possible to configure various clients to work with it. Of these, the HPC staff support rclone and the AWS CLI.

Rclone setup instructions

The following example shows how to set up rclone to work with the object store:

helix$ module load rclone [+] Loading rclone 1.62.2 helix$ rclone config Current remotes: Name Type ==== ==== example s3 e) Edit existing remote n) New remote d) Delete remote r) Rename remote c) Copy remote s) Set configuration password q) Quit config e/n/d/r/c/s/q> n Enter name for new remote. name> hpc-object Option Storage. Type of storage to configure. Choose a number from below, or type in your own value. 1 / 1Fichier \ (fichier) 2 / Akamai NetStorage \ (netstorage) 3 / Alias for an existing remote \ (alias) 4 / Amazon Drive \ (amazon cloud drive) 5 / Amazon S3 Compliant Storage Providers including AWS, Alibaba, Ceph, China Mobile, Cloudflare, ArvanCloud, DigitalOcean, Dreamhost, Huawei OBS, IBM COS, IDrive e2, IONOS Cloud, Liara, Lyve Cloud, Minio, Netease, RackCorp, Scaleway, SeaweedFS, StackPath, Storj, Tencent COS, Qiniu and Wasabi \ (s3) 6 / Backblaze B2 \ (b2) 7 / Better checksums for other remotes \ (hasher) 8 / Box \ (box) 9 / Cache a remote \ (cache) 10 / Citrix Sharefile \ (sharefile) 11 / Combine several remotes into one \ (combine) 12 / Compress a remote \ (compress) 13 / Dropbox \ (dropbox) 14 / Encrypt/Decrypt a remote \ (crypt) 15 / Enterprise File Fabric \ (filefabric) 16 / FTP \ (ftp) 17 / Google Cloud Storage (this is not Google Drive) \ (google cloud storage) 18 / Google Drive \ (drive) 19 / Google Photos \ (google photos) 20 / HTTP \ (http) 21 / Hadoop distributed file system \ (hdfs) 22 / HiDrive \ (hidrive) 23 / In memory object storage system. \ (memory) 24 / Internet Archive \ (internetarchive) 25 / Jottacloud \ (jottacloud) 26 / Koofr, Digi Storage and other Koofr-compatible storage providers \ (koofr) 27 / Local Disk \ (local) 28 / Mail.ru Cloud \ (mailru) 29 / Mega \ (mega) 30 / Microsoft Azure Blob Storage \ (azureblob) 31 / Microsoft OneDrive \ (onedrive) 32 / OpenDrive \ (opendrive) 33 / OpenStack Swift (Rackspace Cloud Files, Memset Memstore, OVH) \ (swift) 34 / Oracle Cloud Infrastructure Object Storage \ (oracleobjectstorage) 35 / Pcloud \ (pcloud) 36 / Put.io \ (putio) 37 / QingCloud Object Storage \ (qingstor) 38 / SMB / CIFS \ (smb) 39 / SSH/SFTP \ (sftp) 40 / Sia Decentralized Cloud \ (sia) 41 / Storj Decentralized Cloud Storage \ (storj) 42 / Sugarsync \ (sugarsync) 43 / Transparently chunk/split large files \ (chunker) 44 / Union merges the contents of several upstream fs \ (union) 45 / Uptobox \ (uptobox) 46 / WebDAV \ (webdav) 47 / Yandex Disk \ (yandex) 48 / Zoho \ (zoho) 49 / premiumize.me \ (premiumizeme) 50 / seafile \ (seafile) Storage> 5 Option provider. Choose your S3 provider. Choose a number from below, or type in your own value. Press Enter to leave empty. 1 / Amazon Web Services (AWS) S3 \ (AWS) 2 / Alibaba Cloud Object Storage System (OSS) formerly Aliyun \ (Alibaba) 3 / Ceph Object Storage \ (Ceph) 4 / China Mobile Ecloud Elastic Object Storage (EOS) \ (ChinaMobile) 5 / Cloudflare R2 Storage \ (Cloudflare) 6 / Arvan Cloud Object Storage (AOS) \ (ArvanCloud) 7 / DigitalOcean Spaces \ (DigitalOcean) 8 / Dreamhost DreamObjects \ (Dreamhost) 9 / Huawei Object Storage Service \ (HuaweiOBS) 10 / IBM COS S3 \ (IBMCOS) 11 / IDrive e2 \ (IDrive) 12 / IONOS Cloud \ (IONOS) 13 / Seagate Lyve Cloud \ (LyveCloud) 14 / Liara Object Storage \ (Liara) 15 / Minio Object Storage \ (Minio) 16 / Netease Object Storage (NOS) \ (Netease) 17 / RackCorp Object Storage \ (RackCorp) 18 / Scaleway Object Storage \ (Scaleway) 19 / SeaweedFS S3 \ (SeaweedFS) 20 / StackPath Object Storage \ (StackPath) 21 / Storj (S3 Compatible Gateway) \ (Storj) 22 / Tencent Cloud Object Storage (COS) \ (TencentCOS) 23 / Wasabi Object Storage \ (Wasabi) 24 / Qiniu Object Storage (Kodo) \ (Qiniu) 25 / Any other S3 compatible provider \ (Other) provider> 25 Option env_auth. Get AWS credentials from runtime (environment variables or EC2/ECS meta data if no env vars). Only applies if access_key_id and secret_access_key is blank. Choose a number from below, or type in your own boolean value (true or false). Press Enter for the default (false). 1 / Enter AWS credentials in the next step. \ (false) 2 / Get AWS credentials from the environment (env vars or IAM). \ (true) env_auth> 1 Option access_key_id. AWS Access Key ID. Leave blank for anonymous access or runtime credentials. Enter a value. Press Enter to leave empty. access_key_id> enter your access key ID here Option secret_access_key. AWS Secret Access Key (password). Leave blank for anonymous access or runtime credentials. Enter a value. Press Enter to leave empty. secret_access_key> enter your secret access key here Option region. Region to connect to. Leave blank if you are using an S3 clone and you don't have a region. Choose a number from below, or type in your own value. Press Enter to leave empty. / Use this if unsure. 1 | Will use v4 signatures and an empty region. \ () / Use this only if v4 signatures don't work. 2 | E.g. pre Jewel/v10 CEPH. \ (other-v2-signature) region> 1 Option endpoint. Endpoint for S3 API. Required when using an S3 clone. Enter a value. Press Enter to leave empty. endpoint> https://bpds3:8443 Option location_constraint. Location constraint - must be set to match the Region. Leave blank if not sure. Used when creating buckets only. Enter a value. Press Enter to leave empty. location_constraint> Option acl. Canned ACL used when creating buckets and storing or copying objects. This ACL is used for creating objects and if bucket_acl isn't set, for creating buckets too. For more info visit https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl Note that this ACL is applied when server-side copying objects as S3 doesn't copy the ACL from the source but rather writes a fresh one. If the acl is an empty string then no X-Amz-Acl: header is added and the default (private) will be used. Choose a number from below, or type in your own value. Press Enter to leave empty. / Owner gets FULL_CONTROL. 1 | No one else has access rights (default). \ (private) / Owner gets FULL_CONTROL. 2 | The AllUsers group gets READ access. \ (public-read) / Owner gets FULL_CONTROL. 3 | The AllUsers group gets READ and WRITE access. | Granting this on a bucket is generally not recommended. \ (public-read-write) / Owner gets FULL_CONTROL. 4 | The AuthenticatedUsers group gets READ access. \ (authenticated-read) / Object owner gets FULL_CONTROL. 5 | Bucket owner gets READ access. | If you specify this canned ACL when creating a bucket, Amazon S3 ignores it. \ (bucket-owner-read) / Both the object owner and the bucket owner get FULL_CONTROL over the object. 6 | If you specify this canned ACL when creating a bucket, Amazon S3 ignores it. \ (bucket-owner-full-control) acl> Edit advanced config? y) Yes n) No (default) y/n> n Configuration complete. Options: - type: s3 - provider: Other - access_key_id: not shown - secret_access_key: not shown - endpoint: https://bpds3:8443 Keep this "hpc-object" remote? y) Yes this is OK (default) e) Edit this remote d) Delete this remote y/e/d> y Current remotes: Name Type ==== ==== hpc-object s3 example s3 e) Edit existing remote n) New remote d) Delete remote r) Rename remote c) Copy remote s) Set configuration password q) Quit config e/n/d/r/c/s/q> q

Using rclone:

helix$ rclone ls hpc-object:bucketname 210 object1 9102 object2 9102 object3

Important usage note: Please add the --s3-no-check-bucket flag to rclone when putting data into a bucket. Without it, you may encounter permission errors.

More documentation is available on our rclone web page or by running rclone --help.

The AWS CLI

Initial set-up:

Note: Naming your AWS profile is optional. Feel free to use the default profile to access the object store. If you wish to do this, leave the --profile option out of the aws configure command.


helix$ ml aws
[+] Loading aws  current  on helix.nih.gov 
helix$ aws configure --profile hpc-object
AWS Access Key ID [None]: enter your key ID here
AWS Secret Access Key [None]: enter your secret key here
Default region name [None]: 
Default output format [None]:

Using the AWS CLI:

Note: Omit the --profile flag from the below if you are using the default profile. Also, the --endpoint-url must be specified on every call.


helix$ aws s3 --endpoint-url=https://bpds3:8443 --profile=hpc-object ls s3://bucketname
2023-04-24 15:33:21        210 object1
2023-04-24 15:33:21       9102 object2
2023-04-24 15:33:21       9102 pbject3

Further information about the AWS CLI may be found via its documentation. The s3 and s3api subcommands and their options are relevant for interactive with the object store.

Object storage policies

Because object storage is different from file based storage, thare are some different rules and policies surrounding its access. However, all of the NIH HPC policies related to storage apply to the object store. Note that the object store may be used for storage of relatively inactive data that still needs to be kept on the HPC systems.

Object storage-specific policies are:

The object store is to be used for data that is actively being used but does not change much (e.g. data files being read in to multiple program runs). Data that is frequently updated must not be kept on the object store.
Data should not be read or written directly onto the object store by an application. Instead, it should be written to a staging area (e.g. /scratch or /lscratch on a compute node) and then copied to the object store when it is no longer actively being written.
Listing the objects present in your bucket can be a time-consuming operation. Please minimize the number of times that you do this.
The object store is not backed up, nor do snapshots exist. If you remove an object by mistake, it is gone forever (unless you have another copy of the data on a different storage system).