Transcription

WhitepaperCapacity OperationsContinuous Compute Optimizationfor Cloud & Container EnvironmentsAndrew HillierCo-Founder & CTO, Densify20210702

CAPOPS: CAPACITY OPERATIONSWhat is Capacity Operations?Capacity Operations, or “CapOps,” is the emerging discipline ofcontinuously optimizing compute resources in cloud and containerenvironments. It fills a gap that has emerged between the DevOps andFinOps processes, where in-depth analysis of the ongoing resourcerequirements of cloud and container-based applications typically isn’tperformed by either group, leading to inflated bills and unnecessaryoperational risk. And although it is driven by the same general goals,CapOps differs from traditional capacity management in that the focus isless on long term planning to make sure there is enough “on the floor,” andmore on continuous alignment of application demands and infrastructuresupply in elastic, “as-a-service” environments.BackgroundSince the dawn of computing, there has been a need to ensure that ITenvironments have sufficient resources to meet application demand,without having too much. As the industry progressed from mainframes,to midrange, to open systems, to virtual environments, the practice ofmanaging capacity evolved with it, ultimately resulting in a highly-maturediscipline designed to minimize the risk of running out of resources, whileat the same time ensuring that over-purchasing is avoided. Specializedactivities such as demand management, risk management, predictiveforecasting and others all helped contribute to the smooth and efficientoperation of IT environments.The advent of cloud computingcreated a disruption in manyareas of IT, and the managementof capacity was no exception. Theability to purchase resources “ondemand” eliminated the need forlong-term planning of hardwarepurchases, and also greatly reducedthe risk of running out of resources.This caused the pendulumto swing away from capacitymanagement and toward the bill,causing many capacity teams to bedisintermediated in the process.The newfound ability to see costsbroken down in extreme detail gaverise to a new focus, and a new breedof tooling, designed to understand,allocate, and minimize costs.1CapacityManagementResourceOptimization Gap(focus on bill)CapacityOperationsOptimization complexityManaged entitiesVirtualCloudContainers VM placement Instance catalog selection Highly-granular resource requests Resource allocation Optimizing elasticity Pods, ReplicaSets, deployments Demand management Micro-purchasing Node optimizationThe shift to public cloud has created a blind spot for organizations where the actualresources being consumed are not being optimized—inflating bills and creating operational risk 2021 Cirba Inc. d/b/a Densify. All rights reserved.

CAPOPS: CAPACITY OPERATIONSBut, focusing on allocating costs and purchasing discounts to minimizethe bill will only get you so far, and in many cases a high cloud bill is justa symptom of a deeper underlying resource problem. If applications areconfigured to use the wrong resources, and if the elastic structures in thecloud are not working efficiently, then no amount of discounting will clawback the extra cost.To truly optimize the efficiency of these environments, while at the sametime ensuring performance requirements are met, the pendulum needs toswing back, and a more disciplined approach to optimizing the resources inuse needs to be taken.Enter Capacity OperationsThe logical path for this to take mirrors what has happened in the areasof development and financial management. Application developmenthas become far more agile, and has effectively merged with certainaspects of operations to become DevOps. This mashup of an offlineactivity (development) and an online practice (operations) helped evolveapplication delivery into a much more agile, elastic and collaborativeprocess. The same shift is also happening to financial optimization, wherethe offline practice of financial management is becoming more operational,producing a much more dynamic FinOps practice that is capable of keepingup with dynamic cloud and container environments.But both of these disciplines have practical limitations. DevOps has themandate to “deliver applications and services at high velocity,” but typicallydoesn’t include detailed analysis and optimization of the resources usedby those applications, either when they are initially deployed, or after theyhave been running. These teams are too busy focusing on new featuresand time-to-market, as they should be, since they are uniquely able tocontrol this.Similarly, FinOps has the mandate of “cloud financial operations,” and isthe formalization of the various financial practices surrounding cloud. Andalthough optimizing resources has a significant impact on the financialpicture, FinOps teams typically do not have the tooling, subject matterexpertise, or bandwidth to delve deeply into detailed resource utilization,optimizing elasticity, sizing containers, or other highly-granular activities. 2021 Cirba Inc. d/b/a Densify. All rights reserved.2

CAPOPS: CAPACITY OPERATIONSFollowing this pattern, the logical evolution of capacity is for it to transitionfrom an offline practice (planning, management) to an online, moreoperational discipline. The resulting “CapOps” practice can be consideredto have the mandate of “continuousresource optimization,” and byBill visibilityrefocusing on the new, moredynamic capabilities of cloudChargebackFinOpsand container infrastructure, it(cloud financial operations)Cost anomaly detectioncan bring back the discipline thatwas temporarily lost. This allowsReservations & Savings Plansorganizations to once again ensurethat there are “sufficient resourcesResource & family optimizationto meet application demand,ResourceContainer optimizationwithout having too much,” fillingOptimization Gapthe gap left by the evolution of theAction execution(approvals, Terraform/CloudFormation)DevOps and FinOps practices.Key Capabilities ofCapacity OperationsCI/CD pipeline integrationCapOps(capacity operations)DevOps(high-velocity app &service delivery)To understand the requirements of CapOps it is useful to draw parallels tothe on-prem data center hosting model, and in particular, what it meansto have infrastructure “on the floor” (and how that infrastructure gets onthe floor). In an on-prem, CapEx-oriented hosting model there is a longlead time for deploying new compute resources, and this drives a lot ofthe capacity analysis that is performed, including forecasting and demandmanagement. Accurately modeling the pipeline of inbound demand,and ensuring resources are available to meet demand, can preventunnecessary risk, and making sure those resources are used efficiently canprevent costly purchases.Cloud infrastructure, on the other hand, enables you to deploy resources“on the floor” in minutes, or even seconds, through API calls or lines ofcode (such as Terraform or CloudFormation). This highly-elastic “micropurchasing” model is a key advantage of the cloud, and eliminates the needfor many of the planning-oriented capacity management activities.provider "aws" {region " {var.aws region}"}Micro-purchasingresource "aws instance" "web" {name "Web Server"instance type "m4.large"ami " {lookup(var.aws amis, var.aws region)}"}3Automation APIs &Infrastructure as CodeAWSCloudFormation 2021 Cirba Inc. d/b/a Densify. All rights reserved.

CAPOPS: CAPACITY OPERATIONSBut, this doesn’t mean capacity can be ignored, as many organizationsinitially assumed, but rather it needs a completely different set of activitiesin order to optimize resources. In many cases these activities must bere-thought from the ground up, since the fundamental assumptions oftraditional processes have changed. For example, even taking inventoryof what is “on the floor” is very different than in on-prem environments,and now resembles more of a “stock chart” of ups and downs than a staticnumber of things that can be counted. This fluidity has a ripple effectthrough many other areas, including capacity.And this micro-purchasing model is a double-edged sword. While providingagility, it also puts resourcing decisions in the hands of engineers anddevelopers who may not have sufficient information to make the rightchoice. In this new world, a relatively junior engineer can put a line of codein a file that causes a purchase, and although this purchase is small, gettingit wrong across many instances can result in tremendous inefficiency andsignificant cost. As a result, even traditional capacity management activitiessuch as rightsizing virtual machines now need to be done in a completelydifferent way, and must adapt to this new form of decentralization ofdecision-making.Given all of this, there are a set of fundamental operations that must beperformed in order to make sure that the right resources are deployed atany point in time. For cloud environments, this includes: Instance sizing (upsize, downsize) and termination Instance family optimization (memory optimized,CPU optimized, burstable) Scaling group node optimization (node type, size) Scaling group scaling parameter optimization (elasticity) DB-as-a-service optimizationOnly one of these operations, instance sizing, resembles something that isdone in traditional virtual environments, but even this must be done verydifferently. As mentioned above, the resources in use are now typicallyspecified in manifests, or “infrastructure as code,” and any optimizationmust be embedded in these manifests, with automation occurring throughthe deployment pipeline. This is very different from virtual environments,where automation typically involves modifying the VMs directly—thisapproach will not work in environments that leverage infrastructure ascode, as the running instances will always revert back to what thecode says. 2021 Cirba Inc. d/b/a Densify. All rights reserved.4

CAPOPS: CAPACITY OPERATIONSInstead, the optimization recommendations must also become lines ofcode to enable continuous optimization, effectively creating “optimizationas code.”provider "aws" {region " {var.aws region}"}“Optimization as code”resource "aws instance" "web" {name "Web Server"#instance type "m4.large"instance type " {aws instance.tags:Densify-optimal-instance-type}"ami " {lookup(var.aws amis, var.aws region)}"}Automation APIs &Infrastructure as CodeAWSCloudFormationBeyond the instance sizing, the rest of the operations are new. Clouds usecatalog-based sizing, and optimization analysis must not only determinethe correct instance size, but also the optimal instance family for a givenworkload, which can be complex to determine. Even within a given family,there may be newer instance types available that are faster, cheaper, orboth, and modernizing to these new instance types can be a quick way togain efficiency.Building on this, the optimizationof scaling groups also benefitsfrom instance-level optimization,as it is common for there to be amismatch between the resourcesbeing consumed by the applicationsand those being provisioned in thescaling groups. And scaling groupsalso enable the optimization ofthe scaling parameters in orderto ensure that they are scaling upwhen needed, and down when not.Optimizing these settings enablesorganization to configure the cloud infrastructure to dynamically respondto load in an optimal manner, something that is not possible in legacyenvironments. This scaling group optimization is becoming increasinglyimportant as organizations move to containers—container clusterstypically run on auto scaling groups, and not only is container performancehighly-dependent on them scaling properly, but the costs of the containerenvironment are also reflected in the scaling group costs.5 2021 Cirba Inc. d/b/a Densify. All rights reserved.

CAPOPS: CAPACITY OPERATIONSCapOps for ContainersIf dealing with the granularity of purchasing in cloud environments createsa resource challenge, then the operational model of containers takes thisto an entirely new level. Containers can be even more dynamic, and farmore granular, often creating an order of magnitude more entities thatmust be optimized. In many ways it is like transitioning from the VM-levelmanagement to process-level management, and each individual workload,such as a web server or queue manager, must be assigned specificresources. To make things even more complicated, these containers can becombined into pods, replica sets, deployments and other structures, whichcan be launched from a single manifest, and all of these structures can begoverned by various quotas to control resource usage.And as with the cloud, these characteristics are both a blessing and acurse. Containers have undeniable benefits when it comes to the flexibilityand agility they provide when deploying new applications and services.But many people mistakenly believe that they will magically optimizethemselves when it comes to resources. This is not the case, and providinginaccurate resource specifications can actually lead to tremendousinefficiency, with resources being stranded and node utilization very low.Part of this misconception is the fact that containers don’t overcommitresources in the same way virtual environments do, meaning that clusteradministrators cannot simply tune overcommit ratios to get higher density.The resources assigned to containers are not virtual resources at all,they are actual resources, meaning they cannot be given out to multipleconsumers at the same time. 2021 Cirba Inc. d/b/a Densify. All rights reserved.6

CAPOPS: CAPACITY OPERATIONSThis removes a key weapon in the battle against inefficiency, and anyover-specification of resources translates directly into the need for moreinfrastructure, either on prem or in the cloud, directly impacting cost.There are also a number of other misconceptions when it comes tocontainers. If containers are very small then many assume that poorresource specifications couldn’t possibly cause high costs, since eachcontainer is so insignificant. But if thousands of poorly configuredcontainers are running then this can add up to a tremendous amount,and this is often observed to be the case when analyzing containerenvironments. Similarly, if containers or microservices run for a very shortlength of time then it is also often assumed that this is relatively harmless.But again, if the services run thousands of times than the error adds up.This “fallacy of insignificance,” combined with the impractical amount ofeffort it would take to manually optimize each container, causes manycontainer environments to be very inefficient.To combat this, the more operationalized form of capacity optimizationprovided by CapOps also help address the gap in container resourceoptimization. This includes: CPU request optimization: This is amount of CPU resources (in“millicores”) guaranteed to a container. The container scheduler mustensure that there are sufficient resources to meet the request values ofall containers on a node, so if this value is too high (which is common)then the scheduler will need to spread the containers across more nodesthan is necessary, and utilization will be low. CPU limit optimization: This is the maximum amount a container canconsume and setting it too low will cause the scheduler to throttle theperformance of a container Memory request optimization: This is the amount of memory (inmegabytes) allocated to a container. Like CPU, if this value is too highthen resources will be stranded, and workload density will be low.7 2021 Cirba Inc. d/b/a Densify. All rights reserved.

CAPOPS: CAPACITY OPERATIONSBut, unlike CPU, if this value is too low, then the scheduler may end upplacing too many containers on a node, and when the aggregate memoryutilization of that node goes above the requested values, the schedulerwill actually kill containers to free up resources. This “Out of MemoryKiller” (or “OOM killer”) is very dangerous, and can be avoided withproper resource optimization. Memory limit optimization: This is the maximum amount of memorya container can consume, and if it is too low, then this can also causecontainers to be killed when their utilization exceeds their limit.By performing this analysis at thecontainer-level, the results can bethen associated back to the Pods,ReplicaSets, and Deploymentsthat they are part of, enabling theoptimization to be embedded inthe manifests that created thecontainers. This provides seamlessautomation, and prevents humansfrom having to manually deal withoptimizing thousands of containers,which simply isn’t viable as containerenvironments grow.Of course, these containers runon nodes, and the optimization ofthose nodes is also critical. For containers running in cloud environmentsthe nodes are usually cloud instances, typically running in scaling groups,and optimization equates to the cloud instance optimization describedabove. And because the container and node configurations affect eachother, the two forms of optimization must be done together to ensurethat the nodes are constantly aligned with the needs of the containers.For example, if container CPU request values are reduced, then memorywill typically become the primary constraint in a cluster, and it might benecessary to transition to memory optimized nodes to maintain efficiency.For on-premises environments, the nodes can be VMs (optimized via sizingand placement) or bare metal, requiring long-term purchase planningconsistent with an on prem capacity management practice.8 2021 Cirba Inc. d/b/a Densify. All rights reserved.

CAPOPS: CAPACITY OPERATIONSTaking ActionAlthough simply knowing that optimization is required, and quantifyingthe costs or risks that exist, can be a useful end in itself, the true goal istypically to take action to actually improve the running environment. Butthis can be challenging, and even organizations that recognize the need forthis type of optimization can fail to make a difference if they don’t take theright approach to actioning the recommendations. Application owners andlines of business are understandably concerned with the stability of theirapplications, and will not allow changes to their environments without anair-tight justification and a significant amount of supporting detail.In order to ensure that the actions generated by a CapOps system metthis high bar, and can actually be taken, there are a number of keyrequirements that must be met:1. Precision: Any recommendation that is generated must be accurate,and account for the minute details that impact the applications. Forexample, if an app requires local storage, then any recommendation tomove to an instance type that doesn’t have local storage is useless. If anapp is 32-bit, then any recommendation to move from an M3 to an M5is a non-starter. And, more commonly, if app components have specificresource requirements dictated by the vendor, such as SAP modulethat must be configured with a specific amount of memory, then anyrecommendations to downsize these instances are counterproductive.A CapOps system must have detailed policies to account for this level ofdetail, and must also use benchmarks to model the impact of changinginstance types, in order to provide sufficient precision to enable action.Without this, an organization will not succeed in promoting change, andtrying to action flawed recommendation will have the perverse effect ofcreating more work for subject matter experts as they need to reviewand vet each recommendation. The last thing you want is a database fullof recommendations you can’t take.2. Integration: The recommendations that are generated by a CapOpssystem need to go somewhere, and in environments with highlydistributed stakeholders, they would ideally go to systems that thoseusers already use, rather than making these users log into somethingnew. This includes reporting and business intelligence (BI) systems,change management systems, and even DevOps tooling and pipelines,where automation can occur.9 2021 Cirba Inc. d/b/a Densify. All rights reserved.

CAPOPS: CAPACITY OPERATIONSTo support this, recommendations need to exist in both machinereadable and human-readable form, enabling socialization as well asautomation. For example, it is very important to have impact analysisreports that provide details of a recommended change (including thepredicted utilization impact). These can be attached to change tickets, ordistributed through messaging systems, and have a tremendous impacton the willingness to approve those changes. Similarly, business grouprankings and “shameback” reports are also useful in promoting action, byproviding transparency acrossthe business.3. Automation: Although success can be had without going to fullautomation, it is typically the long-term goal for many organizations,particularly as they achieve scale and move to containers. As the numberof “moving parts” that must be optimized increases, manual actionbecomes less and less viable, and the risk of human error becomeshigher and higher. But any automation strategy must also adhere tochange management requirements, and the ideal solution is one thatprovides transparency (e.g. app owner reports), change control (e.g.ITSM integration), and full automation of the change when approvalis attained. With sufficient trust in the analytics (consistent with the“precision” requirement) some organizations have negotiated with appteams to remove the approval requirement, which greatly streamlinesthe automation process.In addition to these three requirements, a CapOps system must alsobe open, allowing access to the data and recommendations in order tofeed other tools in the ecosystem. Because collecting data in cloud andcontainer environments can be a challenge, a CapOps system that containsall of this data can be valuable on this basis alone. But combining this rawdata with optimization analysis results and associated metadata is evenmore powerful, and a “Resource Management DB” (or “CapOpsDB”) thatcontains all of this data could well become a key component in futuretooling architectures.10 2021 Cirba Inc. d/b/a Densify. All rights reserved.

CAPOPS: CAPACITY OPERATIONSThe rise of tools like Grafana is evidence of the need for this kind ofcomponent, and expanding the data available to these tools to includedetailed optimization results and predictive analysis models would be alogical progression for most organizations. It is also consistent with themove toward observability, and by combining the CapOps data with logs,tracing, performance analysis and other “data lakes” the combinationcan become greater than the sum of its parts. Regardless of whether anorganization takes a “centrally-managed” versus a “centrally-coordinated”approach to capacity, it always makes sense to start with a “centrallyanalyzed” set of answers.ConclusionUsing history as our guide, and following the evolution of DevOps andFinOps, it is logical that CapOps, or something like it, will emerge to pickup the capacity torch that was temporarily dropped in the move to cloud.Reading the bill, and purchasing commitment-based discounts, will onlytake an organization so far, and the next step is to optimize the actualresources that are purchased. And focusing on the elasticity inherent inthe cloud and container environments, and making sure the cloud-nativeconstructs are working like a well-oiled machine, will make an organizationmuch more responsive to changing business needs, less apt to experienceoperational issues, and far less likely to experience high cloud bills that arenot reflective of their true needs.11 2021 Cirba Inc. d/b/a Densify. All rights reserved.

20210702 Whitepaper Capacity Operations Continuous Compute Optimization for Cloud & Container Environments Andrew Hillier Co-Founder & CTO, Densify