Front coverIBM Cloud Object StorageConcepts and ArchitectureSystem EditionAlexander GavrinBradley LeonardHao JiaJohan VerstrepenJussi LehtinenLars LauberPatrik JelinkoRaj ShahSteven PrattVasfi GucerRedpaper

IntroductionObject Storage is the primary storage solution that is used in the cloud and on-premisessolutions as a central storage platform for unstructured data. Object Storage is growing morepopular for the following reasons: It is designed for exabyte scale. It is easy to manage and yet meets the growing demands of enterprises for a broad set ofapplications and workloads. It allows users to balance storage cost, location, and compliance control requirementsacross data sets and essential applications.IBM Cloud Object Storage (IBM COS) system provides industry-leading flexibility thatenables your organization to handle unpredictable but always changing needs of businessand evolving workloads.IBM COS system is a software-defined storage solution that is hardware aware. Thisawareness allows IBM COS to be an enterprise-grade storage solution that is highly availableand reliable and uses commodity x86 servers. IBM COS takes full advantage of this hardwareawareness by ensuring that the server performs optimally from a monitoring, management,and performance perspective.The target audience for this IBM Redpaper publication is IBM COS architects, IT specialists,and technologists.Summary of changes in this new revision (REDP5537-02):This paper is the third edition of the paper IBM Cloud Object Storage Concepts andArchitecture, REDP5537-00, that was originally published on May 29, 2019.The following new information is included in this revision: Container mode Storage account portal S3 versioning Zone slice storage (ZSS) Object expiration Security enhancementsIBM COS includes a rich set of features to match various use cases. Copyright IBM Corp. 2021. All rights

Figure 1 shows the main features of IBM COS and typical use cases.Figure 1 IBM COS main features and typical use casesIBM validated more than 100 IBM and third-party applications with IBM COS and createdextensive technical documents that describe their interoperability.Validated applications per use case include the following examples: Backup:––––––IBM Spectrum Protect, IBM Spectrum Protect PlusCommvaultVeritas NetBackupRubrikVeeamCohesity Active archive:––––KompriseVeritas Enterprise VaultMoonwalkTigerBridge Enterprise file services:––––IBM Aspera CteraNasuniPanzura Enterprise content management:– IBM Filenet– IBM Content Manager On Demand2IBM Cloud Object Storage Concepts and Architecture: System Edition

Other:–––––IBM Spectrum ScaleApache SparkMerge Healthcare SplunkNice:Tip: Most applications that support S3 API can use IBM COS for storage.For more information, see the IBM Cloud Object Storage storage.Differences between block, file, and Object StorageThe main difference between block, file, and Object Storage is who accesses the data: Block storageBlock storage is visible to operating systems or hypervisors that are running on a baremetal server. Operating systems write blocks of data on to disk tracks and sectors. File storageFile storage is often visible directly to users in the form of a directory by way of SMB orNFS storage protocol. Users must decide and know where to store files and rememberwhere to find them. Object StorageObject Storage is accessed directly from applications by way of RESTful API. An object isstored in a flat namespace with all other objects in the same namespace. An object nameis used to write and read objects from Object Storage.Object Storage provides the capability to add custom metadata to application data.Figure 2 shows the differences between block, file, and Object Storage.3

Figure 2 Differences between block, file, and Object StorageUse cases for Object StorageTypical application use cases for IBM COS across industries include the following examples: Analytics, artificial intelligence, and machine learning data repository. For example,Hadoop and Spark data lakes. IoT data repository; for example, Sensor data collection for autonomous driving. Secondary storage; for example:– Active archive: Tiering of inactive data from primary NAS filers.– Storage for backup data: Leading backup applications have native integration withObject Storage for longer term retention purposes. Storage for cloud native applications: Object Storage is the de-facto standard for cloudnative applications.Industry-specific use cases for IBM COS include the following examples: Healthcare and Life Sciences:– Medical imaging, such as picture archiving and communication system (PACS) andmagnetic resonance imaging (MRI)– Genomics research data– Health Insurance Portability and Accountability Act (HIPAA) of 1996 regulated data Media and entertainment; for example, audio and video Financial services; for example, regulated data that requires long-term retention orimmutabilityFor information about use cases, see IBM Cloud Object Storage System Product Guide,SG24-8439.4IBM Cloud Object Storage Concepts and Architecture: System Edition

Flexible deployment options of IBM Cloud Object StorageIBM COS is available in the following modes: On-premises Object Storage:– IBM hardware appliances with IBM COS software– IBM certified third-party x86 servers with IBM COS software Public Cloud Object Storage (multi-tenant)Figure 3 shows the various deployment options for IBM COS.Figure 3 Deployment options for IBM COSNote: IBM COS Software is available in several licensing models, including perpetual,subscription, or consumption.This IBM Redpaper publication explains the architecture of IBM Cloud Object Storageon-premises offering and the technology behind the product. For more information aboutthe IBM Cloud Object Storage use case scenarios and deployment options, see IBM CloudObject Storage System Product Guide, SG24-8439.For more information about the IBM Cloud Object Storage public cloud offering, see thefollowing publications: Cloud Object Storage as a Service: IBM Cloud Object Storage from Theory to Practice,SG24-8385 How to Use IBM Cloud Object Storage When Building and Operating Cloud NativeApplications, REDP-54915

IBM Cloud Object Storage architectureIBM COS is a dispersed storage system that uses several storage nodes to store pieces ofthe data across the available nodes. IBM COS uses an Information Dispersal Algorithm (IDA)to break objects into encoded and encrypted slices that are then distributed to the storagenodes.No single node has all of the data. This configuration makes it safe and less susceptible todata breaches while needing only a subset of the storage nodes to be available to retrieve thestored data. This ability to reassemble all the data from a subset of the slices dramaticallyincreases the tolerance to node and disk failures.The IBM COS architecture is composed of the following functional components. Each ofthese components runs IBM COS software that can be deployed on certified, industrystandard hardware: IBM Cloud Object Storage ManagerIBM Cloud Object Storage Manager provides a management interface that is used foradministrative tasks, such as system configuration, storage provisioning, and monitoringthe health and performance of the system.The Manager can be deployed as a physical appliance, VMware virtual machine, orDocker container. IBM Cloud Object Storage Accesser nodeIBM Cloud Object Storage Accesser node encrypts and encodes data on write anddecodes and decrypts it on read. It is a stateless component that presents the storageinterfaces to the client applications and transforms data by using an IDA.The Accesser node can be deployed as a physical appliance, VMware virtual machine,Docker container, or can run as an embedded Accesser node on the IBM Slicestor appliance. IBM Cloud Object Storage Slicestor nodeThe IBM Cloud Object Storage Slicestor node is responsible for storing the data slices. Itreceives data from the Accesser node on write and returns data to the Accesser node asrequired by reads. The Slicestor also ensures the integrity of the saved data and rebuilds ifnecessary.Slicestor nodes are deployed as physical appliances.Figure 4 shows a simple architecture layout of the different components in IBM COS.6IBM Cloud Object Storage Concepts and Architecture: System Edition

Figure 4 IBM Cloud Object Storage architectureS3 interface: IBM COS uses the S3 interface for all storage operations; for example: PUT: Writes an object to the storage.GET: Reads an object from the storage.DELETE: Deletes an object from storage.LIST: Lists objects that are in a bucket.All API calls are issued against an IBM COS Accesser node.Core conceptsThis section provides information about IBM COS core concepts. Figure 5 shows the majorIBM COS logical concepts.7

Figure 5 IBM Cloud Object Storage logical conceptsDevice setsIBM COS uses the concept of device sets to group Slicestor devices (see Figure 6 on page 8).Each device set consists of several Slicestor devices.Figure 6 Device set: A set of Slicestor devicesDevice sets can be spread across one or multiple data centers. All Slicestor nodes in onedevice set must have the same configuration (Slicestor node model, number of drives, anddrive size).8IBM Cloud Object Storage Concepts and Architecture: System Edition

Storage poolsA storage pool consists of one or more device sets that can be spread across multiple datacenters, as shown in Figure 7.Figure 7 IBM Cloud Object Storage storage poolsDevice sets in a storage pool can have different configurations. This configuration enablesadding newer Slicestor nodes to a system without replacing older Slicestor nodes.Note: Storage pool expansion must follow specific rules. For more information, see IBMCloud Object Storage System Product Guide, SG24-8439.VaultsVaults are logical storage containers for data objects that are contained in a storage pool, asshown in Figure 8.Important: A vault in IBM COS features the same functionality as an S3 bucket.9

Figure 8 IBM Cloud Object Storage vaultVaults are deployed on a storage pool and automatically spread across all the device sets.One or more vaults can be deployed to a storage pool.Mirrored vaultsA vault that is on one storage pool can be mirrored to a vault on another storage pool,commonly in a different location. Both component vaults are controlled by a mirror andstorage operations are issued against the mirror. All objects in the mirror are available on bothvaults. This concept is usually seen in a two site deployment, but can be used for other usecases, such as hub and spoke design.Recommended practice: Although vaults in a mirrored configuration can have differentIDAs and protection settings, it is recommended to have the same usable capacity on bothstorage pools in a two site configuration.A mirrored setup across two different sites protects the IBM COS system against a sitefailure. If one site is unavailable, reads and writes occur from the available vault automatically.A failover procedure is not required if the application can reach a functioning Accesser nodeat either site. A failback procedure is not required when the site comes back online.Access poolsAn access pool consists of one or more Accesser nodes, which present a vault to anapplication. More than one access pool can separate traffic or restrict access to certainvaults. This way, a tenant separation can be implemented.The connection between access pools and vaults is a many-to-many connection. One vaultcan be deployed on many access pools and one access pool can have more than one vaultdeployed.10IBM Cloud Object Storage Concepts and Architecture: System Edition

Information Dispersal AlgorithmThe Information Dispersal Algorithm (IDA) is based on erasure coding and defines thereliability, availability, and storage efficiency of an IBM COS system. The IDA is defined at thevault level at the vault creation time. The IDA is written as width/read threshold/writethreshold; for example: 12/6/8.The IDA consists of the following components: Width: The width of the IDA is the total number of slices that is generated by erasurecoding. For example, in a 12-wide storage pool all data has 12 slices. Read threshold: The read threshold of an IDA defines the number of slices of the widththat must be available for the data to be readable. For example, if the read threshold of a12-wide system is set to 6, the system needs only six slices to read the data.Tip: If the read threshold is set higher, the IBM COS system can survive fewer failures,but the storage efficiency is better. Write threshold: The write threshold of an IDA is the number of slices of the width thatneed to be written before the Accesser node returns the success to the client. The writethreshold always must be higher than the read threshold so that the data is available, evenif a failure occurs right after the write is completed. For example, if the write threshold ofa 12-wide system is set to 8, the system musty successfully write eight slices to completea write request.Tip: If the write threshold is set lower, the IBM COS system can survive more failures,but the storage efficiency suffers because of the higher redundancies.Expansion factor: The expansion factor is calculated as the width divided by the readthreshold. It also defines the ratio of raw capacity versus usable capacity. See Table 1 onpage 12 for examples.Dispersal modesIBM COS can operate in two different dispersal modes, as shown in Figure 9.Figure 9 Dispersal modes in IBM COSIn Standard Dispersal Mode (SD Mode), which is also called non-Concentrated DispersalMode, each slice is written on a different Slicestor node. This mode ensures the highestperformance and availability because one Slicestor node down means that only one slice is11

unavailable. The SD Mode is usually used in larger configurations and supported on systemswith at least 12 nodes.Note: SD Mode allows you to configure width, read, and write thresholds. For moreinformation about the IDA configuration guidelines, see IBM Cloud Object Storage SystemProduct Guide, SG24-8439.In Concentrated Dispersal Mode (CD Mode), multiple slices of a single object segment areplaced on a single Slicestor node, but never on the same disk. This mode enables costefficient smaller systems starting from 72 terabytes to a few petabytes. If one Slicestor nodegoes down, more slices become unavailable. CD Mode is supported on systems starting withthree Slicestor nodes.Note: CD Mode defines preconfigured IDAs that are optimized for storage or performance.Location optionsIBM COS offers options to be deployed in one or more sites. A single site setup does notprotect against a site failure, although it does provide the lowest possible overhead and thebest latency. Two sites are typically set up as a mirrored configuration. IBM COS plays out alladvantages in a minimum three sites geo-dispersed setup. Slicestor nodes are distributedacross multiple sites for reliability and availability. In a geo-dispersed setup, IBM COS relieson a single copy of data that is protected by way of erasure coding against site failures.The options are shown in Figure 10.Figure 10 IBM Cloud Object Storage location optionsThe nodes of a single IBM COS system can be spread across distances of thousands ofkilometers if the round-trip latency between nodes does not exceed 100 milliseconds.Typical expansion factors are listed in Table 1.Table 1 Typical expansion factors for IBM COS12Number of sitesTypical expansion factor rangeSingle site1.3 - 1.5IBM Cloud Object Storage Concepts and Architecture: System Edition

Number of sitesTypical expansion factor rangeTwo sites2.4 - 2.8Three sites1.8 - 2.0More than three sites1.4 - 1.8Container modeThe default for an IBM COS system is vault mode, which is suitable for most customerdeployments. Systems that require thousands or millions of buckets or tenants, IBM COS canbe deployed in container mode.Note: The general term for a logical storage unit in S3 is a bucket. In vault mode, a bucketis referred to as a vault. In container mode, a bucket is referred to as a container.Container mode provides the following capabilities to the IBM COS system: Support for millions of bucketsSupport for millions of usersSelf-provisioning capability for service by using RESTful APISupport for billing users based on usageIsolation of objects between users and tenantsRecent software versions introduced several enhancements for container mode, including theStorage Account Portal and S3 versioning. These options are discussed later in this section.For more information on container vaults in ClevOS version 3.15.6 and later, see the ManagerAdministration Guide and the Container Mode Guide: mode-container-guide vaults-container vaults-configure-container-modeFor older releases, see the following documents: IBM Cloud Object Storage System Container Mode Guide IBM Cloud Object Storage System Container Mode Feature Description DocumentNotes: Consider the following points: Vaults still exist in container mode and are defined by the operator as container vaults.Individual users can create containers within those container vaults. Some properties ofthe containers, such as the IDA, are inherited from the container vault. IBM COS allows operation in mixed mode, which is the creation of both standard andcontainer vaults on the same system. Container vaults do not support mirroring (vault mirrors). Consider one- or three-sitedeployments when container mode is required.The main features of vault and container mode are compared in Table 2.13

Table 2 Vault and container mode comparisonFeatureVault modeContainer modeMaximum number of buckets1,000 vaults (1,500 withsupported hardware andsystem configuration)Millions of containersMaximum number of usersThousandsMillionsBucket, user, permissionmanagementVia GUI or REST API on theManager nodeVia Service API on theAccesser nodes or StorageAccount Portal in the Managerinterface (ClevOS 3.15.4 )User authentication for thebucketS3 Access KeyID/SecretAccess Key pair orusername/passwordS3 Access KeyID/SecretAccess Key pairBucket deletion authorizationYesNoBucket quota supportYesYesBucket firewall supportYes (IP whitelist)Yes (IP allow/disallow)Bucket indexing optionsName index/Recovery listingonly/No indexName indexBucket tagging supportYesNoObject tagging supportYesYesS3 versioningYesYes (ClevOS 3.15.7 )Retention enabled buckets,legal holdYesYesDelete restricted vaultsYesNoMirroring (2 site datareplication)YesNoS3 proxy and migration supportYesNoAccesser deployment optionsPhysical, embedded, virtualmachine, Docker container,applicationPhysical or virtual machine,Docker containerManager deployment optionsPhysical, virtual machine,Docker containerPhysical server or virtualmachine, Docker container(ClevOS 3.15.1 )Supported dispersal modesStandard and ConcentratedDispersal ModeStandard Dispersal Mode andConcentrated Dispersal mode(ClevOS 3.15.1 )Storage Account PortalThe Storage Account Portal, which was introduced in ClevOS v3.15.4, enables managementstorage accounts, credentials, and containers from the IBM COS Manager’s graphical userinterface in container mode. A new user role, the Storage Account Administrator, isintroduced within the IBM COS Manager. The Storage Account Administrator grants accessto specific users to the portal. In the Storage Account Portal, the administrator performs thefollowing tasks:14IBM Cloud Object Storage Concepts and Architecture: System Edition

For storage accounts:–––––CreateEditDeleteListLookup For credentials (Access Key ID/Secret Access Key pairs):––––CreateDeleteListLookup For containers:––––CreateDeleteListLookupNote: Note: In the previous ClevOS versions, deployment clients needed a solution for thecontainer mode. For example, a portal or script-based procedures for managing accounts,credentials, buckets using the Service API.As of today, some container properties cannot be adjusted from the graphical userinterface and modification still require the usage of the Service API. For example: Hard quota settings IP allow/disallow (firewall) configurations Kafka notifications configurationsThe portal also contains a Usage Metrics section, which allows the administrator to generatereports on the following: Current storage usage by container Aggregated storage usage over a specific period of time by a storage account Daily historical usage over a specified date-range by containerThese reports capture the storage-usage as follows: In units of bytes and objects for current usage In units of byte-hours and object-hours for aggregated or historical usageThe report can be exported in CSV, JSON, and XML formats. The Service API has beenextended to allow for custom historic usage queries when the existing export options are notsufficient.Note: For more information on the Container Mode Service API, see: latest/kc pdf files.html, Table 3 Developer Guides mode-using-service-apis-manger-rest-api15

S3 versioningClevOS 3.15.7 introduces S3 Object Versioning for containers in container mode. This featureis already available for standard vaults. Versioning can be enabled or disabled on a per bucketbasis using S3 API calls.Slight differences between the standard vault and the container vault implementation must beconsidered when moving from a vault to a container with versioning enabled:1. Limit on number of version for an object– Container vault: No limit– Standard vault: Maximum of 1000 versions2. If an object with versionID of null exists when versioning is disabled or suspended, then:– In a container vault this null version will be overwritten when the object is modified.– In a standard vault, the null version will be saved with a new versionID on a subsequentoverwrite.Note: More information on the changes in regards to S3 object versioining can be found inthe release notes of ClevOS 09ogz/2/IBM Release Notes 3.15.7.pdfIBM Cloud Object Storage read and write operationsData import and retrieval for IBM COS is achieved by using standard S3 (HTTP) PUT andGET commands. In the background, the data goes through several transformation stepsacross the IBM COS Accesser node before it is sent to the Slicestor nodes for storage. Theprocess of reading and writing data to IBM COS is described next.Writing data to IBM Cloud Object StorageAs shown in Figure 11, client applications use the S3 API interface to write objects to IBMCOS through an Accesser node. The location of the client application, Accesser node, andSlicestor nodes are independent; all can be at the same site or at different sites. The processof writing data includes the following steps: client application issues a write request and sends the data to an Accesser node.The Accesser node segments the object.The Accesser node encrypts and performs the erasure coding of the data.The system stores data on the Slicestor nodes.The Accesser node acknowledges the write to the client application.IBM Cloud Object Storage Concepts and Architecture: System Edition

Figure 11 Client application sends the data to an IBM Cloud Object Storage Accesser nodeSegmentationIf the object is larger than 4 MiB, the Accesser node splits the data into 4 MiB segments foroptimal performance. For example, a 1 GiB object is split into 250 4 MiB segments, as shownin Figure 12.Figure 12 IBM Cloud Object Storage Accesser node creating a 4 MiB segmentsNote: Smaller objects (less than 1 MiB) are stored together in bin files. This configurationenables better space management and faster read, write, and list operations.Data-at-rest encryption and erasure codingThe individual segments go through a transformation that enables the system to store thedata reliably and securely. This transformation includes two main steps:1. Data-at-rest encryption with SecureSlice.2. Erasure coding and data dispersal with the Information Dispersal Algorithm.SecureSliceSecureSlice uses an all-or-nothing-transform (AONT) to encrypt the data. AONT is a type ofencryption in which the information can be deciphered only if all the content is known.Figure 13 shows the IBM COS SecureSlice AONT encryption process: an integrity check value to the segment.Generate random encryption key.Encrypt data by using encryption key.Calculate the hash of encrypted data.Calculate exclusive-OR (XOR) of hash and encryption key.17

6. Append the result to the encrypted data to create the AONT package.Figure 13 IBM Cloud Object Storage SecureSlice AONT encryptionNotes: Consider the following points: Integrity check values are always added to the data segments, but encryption is anoptional feature. Encryption is enabled by default, and can be disabled on a per vaultbasis in the IBM COS Manager at the time of vault creation. IBM COS software version 3.14.6 introduced the following changes to theimplementation and configuration of data-at-rest encryption with SecureSlice:– A new encryption algorithm was added, AES-GCM-256, which provides up to 15%performance improvement compared to the existing default AES-128.– After upgrading to 3.14.6 or later, AES-GCM-256 becomes the default and therecommended encryption algorithm for the vaults.– Administrators now can choose the preferred encryption algorithm for new vaultsand they can also change the algorithm for the existing vaults.– If the encryption settings are changed for a vault, it is applied to only the newlywritten objects. Existing objects are not reencrypted, and they do not benefit fromimproved performance and better security that is provided by the new algorithm.– It is now possible to enable or disable the SecureSlice for existing vaults at any time.– All changes that are made to the encryption settings generate audit log messagesfor tracking purposes.Information Dispersal AlgorithmIn the next step, erasure coding is performed on the AONT package and data is distributed tothe Slicestor nodes.Note: In general, the erasure coding transforms a message of k symbols into a longermessage with n symbols such that the original message can be recovered from a subset ofthe n symbols (only k symbols are needed to reconstruct the data). In the case of IBM COSerasure coding, k is always the read threshold and n is the width.18IBM Cloud Object Storage Concepts and Architecture: System Edition

The IBM COS erasure coding process includes the following steps:1. The AONT package is sliced to read threshold number of slices.2. Erasure coding creates encoded slices so that the total number of slices (data slices plusencoded slices) is equal to IDA width.3. Data is sent to the Slicestor nodes.Figure 14 shows the main steps of the data transformation.Figure 14 Data transformation in IBM COSStoring the data on the Slicestor nodesAfter erasure coding, data is distributed to the Slicestor nodes. The storage pool and the vaultIDA configuration determines how many Slicestor nodes are used to store the data for aspecific segment.19

Writing data in Standard Dispersal ModeIf the storage pool is configured to use SD Mode, each Slicestor node in a device set stores asingle slice, as shown in Figure 15.Figure 15 Writing to a Standard Dispersal Mode storage poolWriting data in Concentrated Dispersal ModeIf the storage pool is configured to use CD Mode, each Slicestor node stores multiple slices.The system ensures that the slices are not stored on the same drives within the samechassis. This configuration ensures high reliability and that a single drive failure does notcause loss of multiple slices. Figure 16 shows writing to a CD Mode storage pool.Figure 16 Writing to a Concentrated Dispersal Mode storage pool20IBM Cloud Object Storage Concepts and Architecture: System Edition

Writing data to a mirrored vaultIn the case of a mirrored vault, the Accesser node sends the slices to both component vaultsthat always are on different storage pools and typically on different sites. Figure 17 shows awrite operation to a mirrored vault in CD Mode. Write operation to an SD Mode-based vaultmirror works similarly, but only one slice is written to every Slicestor node.Figure 17 Writing data to a mirrored vaultSuccessful write operationsThe write operation is successful and acknowledgment is sent back to the client when theminimum write threshold number of slices are written to the storage pool. The Accesser nodecontinues to write the remaining slices to the Slicestor nodes asynchronously. If any of theslices cannot be written (for example, when a Slicestor node is unavailable), the distributedrebuilder process (which is running on the Slicestor nodes) automatically re-creates themissing slices.For mirrored vaults, the following modes are available when the mirror is created: Asynchronous (default setting)The Accesser node sends acknowledgment to the client application after the writeoperation is confirmed on either side of the mirrors. SynchronousThe Accesser node sends acknowledgment to the client application after the writeoperation completes on both sides of the mirror but both sides do not have to succeed inwriting the data. Hard SynchronousThe Accesser node sends acknowledgment to the client application after both sides of themirror confirmed the successful write.21

Figure 18 shows more information about each mirror mode’s advantages and disadvantages.AsynchronousResponse is returned to the user oncethe write operation has beenconfirmed on EITHER side of the mirrorAsync Writes: Request is sent to bothsides of the Mirror. Needs only one sideof the mirror to respondOn

May 29, 2019 · Figure 8 IBM Cloud Object Storage vault Vaults are deployed on a storage pool and automatically spread across all the device sets. One or more vaults can be deployed to a storage pool. Mirrored vaults A vault that is on one storage pool can be mirrored to a vault o