Biowulf High Performance Computing at the NIH
Object storage

Overview of object storage

Object storage is a relatively new technology that manages data and any associated metadata as objects with unique identifiers in a non-hierarchical structure. This is in contrast to traditional storage systems where data is managed as files in a structure of hierarchical directories. It is popular for cloud and Internet based applications such as photo sharing, and it has recently become popular for scientific applications that store large amounts of raw data.

There are two main reasons that object storage is suitable for these data intensive applications:

Despite its several advantages, there are some limitations and caveats for using object storage:

The NIH HPC Systems object store

The NIH HPC staff have recently deployed an object storage system. The following table outlines some of the key differences between traditional file systems and object storage that are specifically applicable to the HPC systems.

Networked file systems (e.g. /data, /scratch)Object storage
Locating data Files exist within a hierarchy of directories and are accessed by path name Objects are stored in a flat namespace and are accessed by a unique identifier
Reading and writing data Uses standard Linux commands (cat, grep, vi, emacs, etc.) to read and write data. Programs can access files by name. Objects can only be accessed by programs that call specialized functions.
Data storage and resiliancy Data is stored on a more tightly coupled group of disks within the same data center. The system is engineered such that failures of individual disk drives or other components will not cause data loss. Data is stored on many, loosely coupled servers that may be geographically distributed to ensure very high reliability. The system is engineered such that multiple servers can fail and data will not be lost.
Performance Highly optimized for a parallel computing workload. Uses fast, high-quality components. Favors a more distributed workload. Components are high-quality, but performance is limited by network bandwidth between storage devices.

Object storage cannot be accessed in the same manner as file based storage. Most users will use the application programming interface (API) to store and retrieve data from the system. We have developed a set of programs that provide basic input and output operations (put, get, delete, list) on the object storage using its API for those users who do not wish to write their own programs. There are also a number of other utilities, such as Rclone, that can use the S3 protocol to talk to object storage systems.

Requesting access to the object storage

To request an allocation or quota increase on the object storage system, please fill out the object storage request form.

How to use the object storage

The object storage system provides an application programming interface (API) for the management of data. This API is identical to the one first used by Amazon's Simple Storage Service (S3) and now supported by a large number of vendors. The most convenient way to program against it is to use a library for your preferred language that knows how to use the S3 protocol, for example Python's Boto module. Please send e-mail to staff@hpc.nih.gov for more information about writing programs that use the object storage's API.

If you would rather not write your own programs, the HPC staff have developed simple utility scripts that support most day-to-day file management functions. These utilities can be used interactively or within batch scripts to manage data on the object storage system.

Connecting to the object store via its API requires the use of an API key, which is a set of credentials used by the system to identify users. This key will be provided to you by the HPC staff using encrypted or secure e-mail. It is important that you protect these credentials and make sure that no one else gains access to them.

The API key consists of two parts, a key ID and a secret access key. Once you have received these from the HPC staff, you should create a file called /home/$USER/.boto (replace $USER with your user name) having the following contents:

[Credentials]
aws_access_key_id = your_key_id
aws_secret_access_key = your_secret_key

Make sure that this file is only readable by you by issuing the following command:
chmod 0400 /home/$USER/.boto
Helper scripts for API access

Because many users will access the object storage via its API, we have developed scripts that incorporate many file management operations (putting data on the object store, retrieving it, listing objects, etc.). These scripts allow users to manipulate data on the object store without writing their own programs. They also can be used as examples for users who wish to write their own programs that will access the object store. These scripts are written in Python, and described below. In addition, there are various tools that have been written to access object storage. Please e-mail staff@hpc.nih.gov if you are interested in having them installed.

Please note that all scripts support a help option (usually -h but -? for obj_df.

NameDescriptionNotes
obj_lsLists objects on the storage Listing objects can be slow if there is a large number of them. Please use this sparingly, and find other mechanisms for keeping track of objects that you have stored.
obj_putCopies data from a file storage system such as /data or /home to the object store. Supports wildcards. Note that copied data is not deleted from the source file system and will still count against any quotas.
obj_getCopies data from the object store to a file system file or standard output. Also supports wildcards. There are several options for choosing where data from the object store is placed on the filesystem. If obj_put was used to copy the data to the object store, obj_get can attempt to restore the original file modification time and permissions. Data can also be written to standard output and piped to applications.
obj_rmRemoves data from the object store. Removes an object or objects from the object store. For safety, only the "?" wildcard character is supported (matches any single character). Please note that there are no back-ups or snapshots of the object store, so once an object is deleted, it is gone forever.
obj_dfShows space utilization on the object store. Please be careful not to exceed the amount of space allocated to you; if you do, you will not be able to store new objects.
Object storage policies

Because object storage is different from file based storage, thare are some different rules and policies surrounding its access. However, all of the NIH HPC policies related to storage apply to the object store. Note that the object store may be used for storage of relatively inactive data that still needs to be kept on the HPC systems.

Object storage-specific policies are: