|
Alert: The original NIH HPC/Biowulf object storage system has been decommissioned. The HPC staff migrated any data residing in vaults on the Cleverafe object storage system to a new tape-based SpectraLogic BlackPearl system. The documentation on this page relates to the new BlackPearl system.
Object storage is a technology that manages data and any associated metadata as objects with unique identifiers in a non-hierarchical structure. This is in contrast to traditional storage systems where data is managed as files in a structure of hierarchical directories. It is popular for cloud and Internet based applications such as photo sharing, and it has recently become popular for scientific applications that store large amounts of raw data.
There are two main reasons that object storage is suitable for these data intensive applications:
Despite its several advantages, there are some limitations and caveats for using object storage:
The NIH HPC staff have deployed an object storage system. The following table outlines some of the key differences between traditional file systems and object storage that are specifically applicable to the HPC systems.
Networked file systems (e.g. /data, /scratch) | Object storage | |
---|---|---|
Locating data | Files exist within a hierarchy of directories and are accessed by path name | Objects are stored in a flat namespace and are accessed by a unique identifier |
Reading and writing data | Uses standard Linux commands (cat, grep, vi, emacs, etc.) to read and write data. Programs can access files by name. | Objects can only be accessed by programs that call specialized functions. |
Data storage and resiliancy | Data is stored on a more tightly coupled group of disks within the same data center. The system is engineered such that failures of individual disk drives or other components will not cause data loss. | Data is stored on many, loosely coupled servers that may be geographically distributed to ensure very high reliability. The system is engineered such that multiple servers can fail and data will not be lost. |
Performance | Highly optimized for a parallel computing workload. Uses fast, high-quality components. | Favors a more distributed workload. Components are high-quality, but performance is limited by network bandwidth between storage devices. |
Object storage cannot be accessed in the same manner as file based storage. Most users will use the application programming interface (API) to store and retrieve data from the system. We have developed a set of programs that provide basic input and output operations (put, get, delete, list) on the object storage using its API for those users who do not wish to write their own programs. There are also a number of other utilities, such as Rclone, that can use the S3 protocol to talk to object storage systems.
There are a number of important changes between the older object store implementation (retired in Summer, 2023) and the present system:
While the HPC staff is transitioning from the old to the new object storage implementation, no new object storage allocations are being granted except in extreme emergencies. If you urgently need object storage space, please e-mail staff@hpc.nih.gov.
The object storage system provides an application programming interface (API) for the management of data. This API is identical to the one first used by Amazon's Simple Storage Service (S3) and now supported by a large number of vendors. The most convenient way to program against it is to use a library for your preferred language that knows how to use the S3 protocol, for example Python's Boto3 module. Developer-specific information about S3 and boto3 is available online.
If you would rather not write your own programs, the HPC staff have developed simple utility scripts that support most day-to-day file management functions. These utilities can be used interactively or within batch scripts to manage data on the object storage system. There are other clients that can access the object storage as well.
Connecting to the object store via its API requires the use of an API key, which is a set of credentials used by the system to identify users. This key will be provided to you by the HPC staff using encrypted or secure e-mail. It is important that you protect these credentials and make sure that no one else gains access to them.
The API key consists of two parts, a key ID and a secret access key. Once you have received these from the HPC staff, you should create a file called /home/$USER/.boto3 (replace $USER with your user name) having the following contents:
[Credentials] aws_access_key_id = your_key_id aws_secret_access_key = your_secret_keyMake sure that this file is only readable by you by issuing the following command:
chmod 0400 /home/$USER/.boto3
In many cases, the HPC staff will be able to create the .boto3 script for you. In these cases, you do not have to create it yourself.
Because many users will access the object storage via its API, we have developed scripts that incorporate many file management operations (putting data on the object store, retrieving it, listing objects, etc.). These scripts allow users to manipulate data on the object store without writing their own programs. They also can be used as examples for users who wish to write their own programs that will access the object store. These scripts are written in Python, and described below. In addition, there are various tools that have been written to access object storage. Please e-mail staff@hpc.nih.gov if you are interested in having them installed.
Please note that all scripts support a help option (--help for obj2 df and obj2 ls).
IMPORTANT NOTE: If your bucket name is different from your user name, remember to add the -b <BUCKETNAME> flag to your obj2 command.
Name | Description | Notes |
---|---|---|
obj2 ls | Lists objects on the storage | Listing objects can be slow if there is a large number of them. Please use this sparingly, and find other mechanisms for keeping track of objects that you have stored. |
obj2 put | Copies data from a file storage system such as /data or /home to the object store. | Supports wildcards. Note that copied data is not deleted from the source file system and will still count against any quotas. |
obj2 get | Copies data from the object store to a file system file or standard output. | Also supports wildcards. There are several options for choosing where data from the object store is placed on the filesystem. If obj2 put was used to copy the data to the object store, obj2 get can attempt to restore the original file modification time and permissions. Data can also be written to standard output and piped to applications. |
obj2 rm | Removes data from the object store. | Removes an object or objects from the object store. For safety, only the "?" wildcard character is supported (matches any single character). Please note that there are no back-ups or snapshots of the object store, so once an object is deleted, it is gone forever. |
obj2 df | Shows space utilization on the object store for all accessible buckets. | Please be careful not to exceed the amount of space allocated to you; if you do, you will not be able to store new objects. |
Because the object store uses reasonably standard S3, it is possible to configure various clients to work with it. Of these, the HPC staff support rclone and the AWS CLI.
Rclone setup instructions
The following example shows how to set up rclone to work with the object store:
helix$ module load rclone [+] Loading rclone 1.62.2 helix$ rclone config Current remotes: Name Type ==== ==== example s3 e) Edit existing remote n) New remote d) Delete remote r) Rename remote c) Copy remote s) Set configuration password q) Quit config e/n/d/r/c/s/q> n Enter name for new remote. name> hpc-object Option Storage. Type of storage to configure. Choose a number from below, or type in your own value. 1 / 1Fichier \ (fichier) 2 / Akamai NetStorage \ (netstorage) 3 / Alias for an existing remote \ (alias) 4 / Amazon Drive \ (amazon cloud drive) 5 / Amazon S3 Compliant Storage Providers including AWS, Alibaba, Ceph, China Mobile, Cloudflare, ArvanCloud, DigitalOcean, Dreamhost, Huawei OBS, IBM COS, IDrive e2, IONOS Cloud, Liara, Lyve Cloud, Minio, Netease, RackCorp, Scaleway, SeaweedFS, StackPath, Storj, Tencent COS, Qiniu and Wasabi \ (s3) 6 / Backblaze B2 \ (b2) 7 / Better checksums for other remotes \ (hasher) 8 / Box \ (box) 9 / Cache a remote \ (cache) 10 / Citrix Sharefile \ (sharefile) 11 / Combine several remotes into one \ (combine) 12 / Compress a remote \ (compress) 13 / Dropbox \ (dropbox) 14 / Encrypt/Decrypt a remote \ (crypt) 15 / Enterprise File Fabric \ (filefabric) 16 / FTP \ (ftp) 17 / Google Cloud Storage (this is not Google Drive) \ (google cloud storage) 18 / Google Drive \ (drive) 19 / Google Photos \ (google photos) 20 / HTTP \ (http) 21 / Hadoop distributed file system \ (hdfs) 22 / HiDrive \ (hidrive) 23 / In memory object storage system. \ (memory) 24 / Internet Archive \ (internetarchive) 25 / Jottacloud \ (jottacloud) 26 / Koofr, Digi Storage and other Koofr-compatible storage providers \ (koofr) 27 / Local Disk \ (local) 28 / Mail.ru Cloud \ (mailru) 29 / Mega \ (mega) 30 / Microsoft Azure Blob Storage \ (azureblob) 31 / Microsoft OneDrive \ (onedrive) 32 / OpenDrive \ (opendrive) 33 / OpenStack Swift (Rackspace Cloud Files, Memset Memstore, OVH) \ (swift) 34 / Oracle Cloud Infrastructure Object Storage \ (oracleobjectstorage) 35 / Pcloud \ (pcloud) 36 / Put.io \ (putio) 37 / QingCloud Object Storage \ (qingstor) 38 / SMB / CIFS \ (smb) 39 / SSH/SFTP \ (sftp) 40 / Sia Decentralized Cloud \ (sia) 41 / Storj Decentralized Cloud Storage \ (storj) 42 / Sugarsync \ (sugarsync) 43 / Transparently chunk/split large files \ (chunker) 44 / Union merges the contents of several upstream fs \ (union) 45 / Uptobox \ (uptobox) 46 / WebDAV \ (webdav) 47 / Yandex Disk \ (yandex) 48 / Zoho \ (zoho) 49 / premiumize.me \ (premiumizeme) 50 / seafile \ (seafile) Storage> 5 Option provider. Choose your S3 provider. Choose a number from below, or type in your own value. Press Enter to leave empty. 1 / Amazon Web Services (AWS) S3 \ (AWS) 2 / Alibaba Cloud Object Storage System (OSS) formerly Aliyun \ (Alibaba) 3 / Ceph Object Storage \ (Ceph) 4 / China Mobile Ecloud Elastic Object Storage (EOS) \ (ChinaMobile) 5 / Cloudflare R2 Storage \ (Cloudflare) 6 / Arvan Cloud Object Storage (AOS) \ (ArvanCloud) 7 / DigitalOcean Spaces \ (DigitalOcean) 8 / Dreamhost DreamObjects \ (Dreamhost) 9 / Huawei Object Storage Service \ (HuaweiOBS) 10 / IBM COS S3 \ (IBMCOS) 11 / IDrive e2 \ (IDrive) 12 / IONOS Cloud \ (IONOS) 13 / Seagate Lyve Cloud \ (LyveCloud) 14 / Liara Object Storage \ (Liara) 15 / Minio Object Storage \ (Minio) 16 / Netease Object Storage (NOS) \ (Netease) 17 / RackCorp Object Storage \ (RackCorp) 18 / Scaleway Object Storage \ (Scaleway) 19 / SeaweedFS S3 \ (SeaweedFS) 20 / StackPath Object Storage \ (StackPath) 21 / Storj (S3 Compatible Gateway) \ (Storj) 22 / Tencent Cloud Object Storage (COS) \ (TencentCOS) 23 / Wasabi Object Storage \ (Wasabi) 24 / Qiniu Object Storage (Kodo) \ (Qiniu) 25 / Any other S3 compatible provider \ (Other) provider> 25 Option env_auth. Get AWS credentials from runtime (environment variables or EC2/ECS meta data if no env vars). Only applies if access_key_id and secret_access_key is blank. Choose a number from below, or type in your own boolean value (true or false). Press Enter for the default (false). 1 / Enter AWS credentials in the next step. \ (false) 2 / Get AWS credentials from the environment (env vars or IAM). \ (true) env_auth> 1 Option access_key_id. AWS Access Key ID. Leave blank for anonymous access or runtime credentials. Enter a value. Press Enter to leave empty. access_key_id> enter your access key ID here Option secret_access_key. AWS Secret Access Key (password). Leave blank for anonymous access or runtime credentials. Enter a value. Press Enter to leave empty. secret_access_key> enter your secret access key here Option region. Region to connect to. Leave blank if you are using an S3 clone and you don't have a region. Choose a number from below, or type in your own value. Press Enter to leave empty. / Use this if unsure. 1 | Will use v4 signatures and an empty region. \ () / Use this only if v4 signatures don't work. 2 | E.g. pre Jewel/v10 CEPH. \ (other-v2-signature) region> 1 Option endpoint. Endpoint for S3 API. Required when using an S3 clone. Enter a value. Press Enter to leave empty. endpoint> https://bpds3:8443 Option location_constraint. Location constraint - must be set to match the Region. Leave blank if not sure. Used when creating buckets only. Enter a value. Press Enter to leave empty. location_constraint> Option acl. Canned ACL used when creating buckets and storing or copying objects. This ACL is used for creating objects and if bucket_acl isn't set, for creating buckets too. For more info visit https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl Note that this ACL is applied when server-side copying objects as S3 doesn't copy the ACL from the source but rather writes a fresh one. If the acl is an empty string then no X-Amz-Acl: header is added and the default (private) will be used. Choose a number from below, or type in your own value. Press Enter to leave empty. / Owner gets FULL_CONTROL. 1 | No one else has access rights (default). \ (private) / Owner gets FULL_CONTROL. 2 | The AllUsers group gets READ access. \ (public-read) / Owner gets FULL_CONTROL. 3 | The AllUsers group gets READ and WRITE access. | Granting this on a bucket is generally not recommended. \ (public-read-write) / Owner gets FULL_CONTROL. 4 | The AuthenticatedUsers group gets READ access. \ (authenticated-read) / Object owner gets FULL_CONTROL. 5 | Bucket owner gets READ access. | If you specify this canned ACL when creating a bucket, Amazon S3 ignores it. \ (bucket-owner-read) / Both the object owner and the bucket owner get FULL_CONTROL over the object. 6 | If you specify this canned ACL when creating a bucket, Amazon S3 ignores it. \ (bucket-owner-full-control) acl> Edit advanced config? y) Yes n) No (default) y/n> n Configuration complete. Options: - type: s3 - provider: Other - access_key_id: not shown - secret_access_key: not shown - endpoint: https://bpds3:8443 Keep this "hpc-object" remote? y) Yes this is OK (default) e) Edit this remote d) Delete this remote y/e/d> y Current remotes: Name Type ==== ==== hpc-object s3 example s3 e) Edit existing remote n) New remote d) Delete remote r) Rename remote c) Copy remote s) Set configuration password q) Quit config e/n/d/r/c/s/q> q
Using rclone:
helix$ rclone ls hpc-object:bucketname 210 object1 9102 object2 9102 object3
Important usage note: Please add the --s3-no-check-bucket flag to rclone when putting data into a bucket. Without it, you may encounter permission errors.
More documentation is available on our rclone web page or by running rclone --help.
The AWS CLIInitial set-up:
Note: Naming your AWS profile is optional. Feel free to use the default profile to access the object store. If you wish to do this, leave the --profile option out of the aws configure command.
helix$ ml aws [+] Loading aws current on helix.nih.gov helix$ aws configure --profile hpc-object AWS Access Key ID [None]: enter your key ID here AWS Secret Access Key [None]: enter your secret key here Default region name [None]: Default output format [None]:
Using the AWS CLI:
Note: Omit the --profile flag from the below if you are using the default profile. Also, the --endpoint-url must be specified on every call.
helix$ aws s3 --endpoint-url=https://bpds3:8443 --profile=hpc-object ls s3://bucketname 2023-04-24 15:33:21 210 object1 2023-04-24 15:33:21 9102 object2 2023-04-24 15:33:21 9102 pbject3
Further information about the AWS CLI may be found via its documentation. The s3 and s3api subcommands and their options are relevant for interactive with the object store.
Because object storage is different from file based storage, thare are some different rules and policies surrounding its access. However, all of the NIH HPC policies related to storage apply to the object store. Note that the object store may be used for storage of relatively inactive data that still needs to be kept on the HPC systems.
Object storage-specific policies are: