2009 29th IEEE International Conference on Distributed Computing SystemsA Reinforcement Learning Approach to Online Web Systems Auto-configurationXiangping Bu, Jia Rao, Cheng-Zhong XuDepartment of Electrical & Computer EngineeringWayne State University, Detroit, Michigan 48202{xpbu,jrao,czxu}@wayne.eduAbstractadequate knowledge about the system, get familiar with eachof parameters, and run numerous trail-and-error tests.Another challenge in configuration comes from the dynamic trait of web systems. On the Internet, the systemsshould be able to accommodate a wide variety of servicedemands and frequent components in both software andhardware. Chung et al. [2] showed that in web system nosingle universal configuration is good for all workloads.Zheng et al. [20] demonstrated that in a cluster-basedInternet service, the system configuration should be modifiedto adjust to cluster nodes updates.Moreover, virtual machine technology and related utility and cloud computing models pose new challenges inweb system configuration. VM technology enables multiplevirtualized logical machines to share hardware resourceson the same physical machine. This technology facilitateson-demand hardware resource reallocation [12] and servicemigration [3]. Next-generation enterprise data centers willbe designed in a way that all hardware resources arepooled into a common shared infrastructure; applicationsshare these remote resources on demand [11], [17]. It isdesirable that the resources allocated to each VM should beadjusted dynamically for the provisioning of QoS guaranteesand meanwhile maximizing resource utilization [8]. Thisdynamic resource allocation requirement adds one moredimension of challenge to the configuration of web systemshosted in virtual machines. In particular, the configurationneeds to be carried out on-line and automatically.There were many past studies devoted to autonomicconfiguration of web systems; see [18], [19], [2], [20] forexamples. Most of them focused on performance parameterstuning for dynamic workload in static environments. Theiroptimization approaches are hardly applicable to onlinesetting of the parameters in VM-based dynamic platformsdue to their high time complexity. There were a few controlapproaches targeted at online tuning in response to changingworkload [5]. They were largely limited to tuning of singleMaxClient parameter because of the inherent complexityof the control.In this paper, we propose a reinforcement learning approach, namely RAC, for automatic configuration of multitier web systems in VM-based dynamic environments. Reinforcement learning is a process of learning from interac-In a web system, configuration is crucial to the performance and service availability. It is a challenge, not onlybecause of the dynamics of Internet traffic, but also thedynamic virtual machine environment the system tends to berun on. In this paper, we propose a reinforcement learningapproach for autonomic configuration and reconfigurationof multi-tier web systems. It is able to adapt performanceparameter settings not only to the change of workload, butalso to the change of virtual machine configurations. The RLapproach is enhanced with an efficient initialization policy toreduce the learning time for online decision. The approachis evaluated using TPC-W benchmark on a three-tier website hosted on a Xen-based virtual machine environment.Experiment results demonstrate that the approach can autoconfigure the web system dynamically in response to thechange in both workload and VM resource. It can drive thesystem into a near-optimal configuration setting in less than25 trial-and-error iterations.1. IntroductionWeb systems like Apache and Tomcat applications oftencontain a large number of parameters; their settings are crucial to systems performance and service availability. Manualconfiguration based on operator’s experience is a non-trivialand error-prone task. Recent studies revealed that more than50% root causes of Internet service outages was due tosystem misconfiguration caused by operator mistakes [6].The configuration challenge is due to a number of reasons.First is the increasing system scale and complexity thatintroduce more and more configurable parameters to a levelbeyond the capacity of an average-skilled operator. Forexample, both of the Apache server and Tomcat server havemore than a hundred configurable parameters to set for different running environments. In a multi-component system,the interaction between the components makes performancetuning of the parameters even harder. Performance optimization of individual component does not necessarily lead tooverall system performance improvement [2]. Therefore, tofind an appropriate configuration, the operator must develop1063-6927/09 25.00 2009 IEEEDOI 10.1109/ICDCS.2009.762

Table 1. Tunable performance critical parameters.tions. For a web system, its possible configurations forma state space. We define actions as reconfiguration of theparameters. Reinforcement learning is intended to determineappropriate actions at each state to maximize the longterm reward. Recent studies showed the feasibility of RLapproaches in resource allocation [14], [16], [9], power management [15], job scheduling in grid [1] and self-optimizingmemory controller [4]. To best of our knowledge, the RACapproach should be the first one in the application of the RLprinciple to automatic configuration of web systems.The RAC approach has the following features: (1) It isapplicable to multi-tier web systems where each tier containsmore than one key parameters to configure; (2) It is able toadapt system configuration to the change of workload inVM-based dynamic environments where resource allocatedto the system may change over time; (3) It is able to supportonline auto-configuration.Online configuration has a time efficiency requirement,which renders conventional RL approaches impractical. Toreduce the initial learning overhead, we equip the RL algorithm with efficient heuristic initialization policies. We developed a prototype configuration management agent, basedon the RAC approach. The agent is non-intrusive in the sensethat it requires no change in either server or client sides.All the information needed is application level performancesuch as throughput and response time. We experimented withthe RAC agent for a three-tier TPC-W website/benchmarkon a Xen-based virtual machine environment. Experimentresults showed that the RAC agent can auto-configure theweb system dynamically in response to the change in bothworkload and VM resource. It can drive the system into anear-optimal configuration setting in less than 25 trial-anderror iterations.The rest of this paper is organized as follows. Section 2presents scenarios to show the challenges in configurationmanagement in dynamic environments. Section 3 presentsbasic idea of the RL approach and its application in autoconfiguration. Enhancement of the approach with policyinitialization is given in Section 4. Section 5 gives theexperimental results. Related work is discussed in Section 6.Section 7 concludes the paper with remarks on limitationsof the approach and possible future sKeepalive sion timeoutminSpareThreadsmaxSpareThreadsRanges[50, 600][1, 21][5, 85][15, 95][50, 600][1, 35][5, 85][15, 95]Default1501551520030550For instance, MaxClients is one of the key performanceparameters in Apache, which sets the maximum number ofrequests to be served simultaneously. Setting it to a too smallnumber would lead to low resource utilization; in contrast,a high value may drive the system into an overloaded state.With limited resource, how to set this parameter should bedetermined by the requests resource consumption and theirarrival rates. Configurations of this parameter for resourceintensive workload may lead to poor performance underlightly loaded conditions.To investigate the effect of configuration on performance, we conducted experiments on a three-tierApache/Tomcat/MySQL website. Recall Apache and Tomcateach has more than a hundred configuration parameters.Based on recent reports of industry practices and our owntest results, we selected eight most performance relevant runtime configurable parameters from different tiers, as shownin Table 1. For simplicity in testing, we assumed the defaultsettings for the MySQL parameters.We tested the performance using TPC-W benchmark.TPC-W benchmark defines three types of workload: ordering, shopping, and browsing, representing three differenttraffic mixes. It is expected that each workload has itspreferred configuration, under which the system would yieldthe lowest average response time. Figure 1 shows the systemperformance for different workloads under the three bestconfigurations (out of our test cases). From the figure, weobserve that there is no single configuration suitable for allkinds of workloads. In particular, the best configuration forshopping or browsing would yield extremely poor performance under ordering workload.2. Challenges in Website Configuration2.2. Match Configuration to Dynamic VM Environments2.1. Match Configuration to WorkloadFor a web system hosted on VMs, its capacity is cappedby the VM resources. It tends to change with reconfigurationof the VM (for fault tolerance, service migration, and otherpurposes). The change of the VM configuration renders theprevious web system configuration obsolete and hence callsfor reconfiguration online. Such reconfigurations are errorprone and sometimes even counter-intuitive.Application level performance of a web system heavilydepends on the characteristics of the incoming workload.Different types of workloads may require different amountsand different types of resources. Application configurationmust match the need of current workloads to achieve a goodperformance.3

10000ordering best configurationshopping best configurationbrowsing best configurationResponse Time(ms)Response Time(ms)10000100010010orderingshoppingFigure 1. Performance under configurations tuned fordifferent workloads.Response Time(ms)100000level 1level 2level 3100Level 2Number of ClientsLevel 1Level 2Level 3rameters under different VM configurations. Their effectsare sometimes counter-intuitive due to the dynamic featuresof web systems. Figure 3 shows no single configuration isbest for all platforms. In particular, the performance underLevel-2 resource may even deliver better performance underLevel-1 platform.1000Level 1100Figure 3. Performance under configurations tuned fordifferent VMs.1000010100010browsinglevel1 best configurationlevel2 best configurationlevel3 best configuration3. Reinforcement Learning Approach to AutoconfigurationLevel 3Figure 2. Effect of MaxClients on performance.In this section, we will present an overview of our RLapproach and its application to auto-configuration.In the following, we still use MaxClients parameterto show the challenges due to VM resource change. In thisexperiment, we kept a constant workload and dynamicallychanged the VM resource allocated to the application anddatabase servers. We defined three levels of resource provisioning: Level-1 (4 virtual CPUs and 4GB memory), Level-2(3 virtual CPUs and 3GB memory), and Level-3 (2 virtualCPUs and 2GB memory). Figure 2. shows the impact ofdifferent MaxClients settings under different VM configurations. From the figure, we can see that each platformhas each own preferred MaxClients setting leading to theminimum response time. We notice that as the capacity ofthe machine increases, the optimal value of MaxClientsactually goes down instead of going up as we initiallyexpected. The main reason for this counter-intuitive findingis that with the VM becoming more and more powerful, itcan complete a request in a shorter time. As a result, thenumber of concurrent requests will decrease and there isno need for a large MaxClients number. Moreover, themeasured response time included request queuing time andits processing time. The MaxClients parameter controlsthe balance between these two factors. A large value wouldreduce the queueing time, but at the cost of processingtime because of the increased level of concurrency. Thetradeoff between the queuing time and processing time isheavily dependent on the concurrent workload and hardwareresource.MaxClient aside, we tested the settings of other pa-3.1. Parameter Selection and Auto-configurationToday’s web systems often contain a large number ofconfigurable parameters. Not all of them are performancerelevant. For tractability of auto-configuration, we first selectthe most performance-critical parameters as configurationcandidates. Because online reconfiguration is intended toperformance improvement at the cost of its run-time overhead. Including a huge number of parameters will sharplyincrease the online search spaces, causing a long timedelay to converge or making the system unstable. To selectan appropriate tuning parameter, we have to deal withthe tradeoff between how much the parameter affects theperformance and how much overhead it causes during theonline searching.Even from the performance perspective, how to selectthe appropriate parameters for configuration is a challenge.In [20], authors used parameters dependency graph to findthe performance relevant parameters and the relationshipamong them. Our focus is on autonomic reconfigurationin response to system variations by adjusting a selectivegroup of parameters. Table 1 lists the parameters we selectedand the ranges of their values for testing purposes. How toautomatically select the relevant parameters is beyond thescope of this paper.For a selective group of parameter in different tiers, we design a RL-based autonomic configuration agent for multi-tier4

web systems. The agent consists of three key components:performance monitor, decision maker, and configurationcontroller. The performance monitor passively measures theweb system performance at a predefined time interval (we setit to 5 minutes in experiments), and sends the information toRL-based decision maker. The only information the decisionmaker needs is the application level performance such asresponse time or throughput. It requires no OS-level orhardware level information for portability. The decisionmaker runs a RL algorithm and produces a state-action table,called Q-value table. A state is defined as a configuration ofthe selected parameters. Possible actions include increasing,decreasing their values or keeping unchanged; see the nextsection for details. Based on the dynamically updated Qtable, the configuration controller generates the configurationpolicy and reconfigures the whole system if necessary.SLA, a lower response time returns a positive reward to theagent; otherwise the agent will receive a negative penalty.Q-value Learning. The temporal difference(TD) ismost suitable for our work due to its two advantages: It needsno model of the environment and it updates Q-values at eachtime step based on its estimation. Using such incrementalfashion, the average Q-value of an action a on state s,denoted by Q(s, a), can be refined once after each immediatereward r is collected:Q(st , at ) Q(st , at ) α [rt 1 γ Q(st 1 , at 1 ) Q(st , at )],where α is a learning rate parameter that facilitates convergence to the true Q-values in the presence of noisy orstochastic rewards and state transitions [13], and γ is thediscount rate to guarantee the accumulated reward convergence in continuing task. Algorithm 1 presents the pseudocode of our Q-value learning algorithm.3.2. RL-based Decision MakingAlgorithm 1 Q-value Learning Algorithm.Reinforcement learning is a process of learning throughinteractions with an external environment (or the web system in this paper). The reconfiguration process is typically formulated as a finite Markov decision process(MDP),which consists of a set of states and several actions foreach state. During each state transition, the learning agentshould receive a reward defined by a reward functionR E[rt 1 st s, at a, st 1 s ]. The goal of theagent is to develop a policy π : S A to maximize thecollected cumulative rewards based on iterative trial-anderror interactions [13].We first cast the online automatic configuration problemas a MDP, by defining state space S, action set A, andimmediate reward function r(s, a).State Space. For the online auto-configuration task,we define a state as possible system configuration. For theselective group of n parameters, we represent a state by avector in the form as:1:2:3:4:5:6:7:8:9:10:11:12:13:14:Initialize Q tableInitialize state sterror 0repeatfor each state s doat get action(st ) using greedy policyfor (step 1;step LIMIT;step ) doTake action at observe r and St 1Qt Qt α (r γ Qt 1 Qt )error M AX(error, Qt Qprevious t )st st 1 , at 1 get action(st ), at at 1end forend foruntil error θ4. Online Learning and AdaptationRL algorithm explores system dynamic features by interacting with the external environment. A practical problemwith the basic algorithm is that the number of Q-values thatneed to explore increases exponentially with the number ofattributes used in state representations [4]. The initial poorperformance and long time convergence make the onlinelearning (P ara1 , P ara2 , · · · , P aran ).Action Set. We define three basic actions: increase,decrease, and keep associated with each parameter. We usea vector ai to represent an action on parameter i. Eachelement itself is a 3-element vector, indicating taken/nottaken (1/0) of three actions. For example, the followingnotation represents an increase action on parameter i:4.1. Policy Initializationaincrease (· · · , P arai (1, 0, 0), P aran (0, 0, 0)).iThe initial poor performance and poor scalability wouldlimit the potential of RL algorithms for online autoconfiguration. For a remedy, our RL agent assumes anexternal policy initialization strategy to accelerate the learning process. Briefly, it first samples the performance ofa small portion of typical configurations and uses thesesample data to predict the performance of other similarconfigurations. Based on these information, the agent runsImmediate Reward. The immediate reward shouldcorrectly reflect the system performance. The immediatereward r at time interval t is defined asrt SLA perft ,where SLA is a reference time predefined in Service LevelAgreement, and perf is measured response time. For a given5

Response Time(ms)1300120011001000900800700600500Algorithm 2 Policy Initialization.Response rameter Groupingfor each group doCollect data in coarse granularityend forGenerate regression-based predicting functionPredict performance of unvisited configurationRun RL process to learn an initial policy8004.2. Online LearningFigure 4. Concave upward effect of MaxClients andregression.Although the policy initialization could avoid the initialpoor performance, it may not be accurate enough for reconfiguration decision. In the subsequent online learning,our agent keeps measuring the current system performanceand retraining the Q-value table at each time interval usingAlgorithm 1with α 0.1, γ 0.9, 0.05 . For eachretraining procedure, the agent updates the performanceinformation for current configuration but still keep the oldinformation for other configurations. Based on these updatedperformance information, it updates the Q-value table usingbatch training so as to guarantee that most of the statesare aware of the new changes in the system. After eachretraining, the agent will then direct the system to the nextstate based on the new policy derived from the updated Qvalue table.another reinforcement learning process to generate an initialpolicy for the online learning procedure.First, to learn the initial policy, we need to collect trainingdata for the subsequent RL Learning. It is not practical tocollect the performances of all the configurations due to itslong time consumption. A key issue is to choose representative states for approximation. In implementation, we use atechnique named parameter grouping to group parameterswith similar characteristics together so as to reduce thestate space. For example, both parameters MaxClientsand MaxThreads are limited by the system capacity andboth parameters KeepAlive timeout and sessiontimeout are limited by the number of multiple connectiontransactions. Then the first two parameters form one groupand the other two form another group. The parameters inthe same group are always given the same value. Moreover,coarse granularity is used for each group during trainingdata collection instead of the fine granularity used in onlinelearning.4.3. Adaptation to Workload and System DynamicsRecall the web system hosted in a VM-based environment,there are two dynamic factors: the changing of incomingworkloads and VM resource variation. As we discussedin Section 2, there is no single best configuration for alltypes of workloads and VM resource profiles. We callthe combinations of traffic mixes and the VM resourcesettings system contexts. To address the problem of poorinitial performance and accelerate the learning process, weconstruct different initialization policies for different scenarios through offline training. Algorithm 3 shows the onlinetraining and adaptation algorithm. The RL agent continuously collect the immediate reward in each configuration,and compares it with the average of the last n values.An abrupt change in the reward value is considered as aviolation. If violations are detected in several consecutiveiterations, the agent believes that there is a context changeand switches to a corresponding initial policy. The violationsare detected based on a violation threshold v thr. Oncethere are s thr times violations happened continuously, theagent will switch to a most suitable initial policy accordingto the current performance. The threshold s thr controlsthe trade-off between the agent’s adaptability and stability.Setting it to a too small value will make the agent toosensitive to system fluctuations but a too large value willAfter having the state value of representative configurations, we use a simple but efficient method to predictthe performance of other configurations. It is based on thefact that all parameters have a concave upward effect onthe performance, as revealed in [5]. Figure 4 shows theconcave upward effect of a single parameter MaxClienton response time, observed in one of our experiments. Byusing polynomial regression algorithm, we formulate theperformance as a function of configurable parameters andpredict performance of the absent states in the data collectionstep.After getting all the training data, we run a offlinereinforcement learning process showed in Algorithm 1 togenerate an initial Q-value table for online learning. Inimplementation, we set α 0.1, γ 0.9, 0.1 for theoffline training. Algorithm 2 gives the pseudo-code of thepolicy initialization algorithm.6

Table 2. Examples of contexts with different workloadsand VM resources.harm the agent’s adaptability. The effect of the s thr willbe discussed in Section 5.2. Empirically, we set n, s thr,and the v thr to 10, 5, and 0.3, ontext-5Context-6pvar rptimecur rptimeaver /rptimeaver , 0 if pvar v thr;violation 1 otherwise.Workload ingVM resourcesLevel 1Level 1Level 3Level 2Level 2Level 1Algorithm 3 Online Learning.1:2:3:4:5:6:7:8:9:10:11:Input initialized Q-value tableInput initialized state Stfor each configuration iteration doIssue reconfiguration action based on current Q-value tableMeasure current performanceCheck context variationsIf number of consecutive violations exceeds s thrThen Switch policyUpdate Q-value table using Algorithm 1Enter the next stepend fordifferent combinations of browsing and ordering requests. Toevaluate the RAC approach’s adaptation to system dynamics,we changed the client traffic as well as the resourcesallocated to the VMs hosting the website. Considering thefact that the application server and data base server are thebottleneck for our system, only the resources allocated tothe VM hosting the last two tiers are changed. Table 2 liststhe combinations of the client traffic and VM resources.5.2. Performance of Configuration Policies5. Experiments ResultsIn this section, we studied the effectiveness of the RACapproach for online auto-configuration. We compared theRAC agent with the other two configuration approaches. Thefirst one is with static default parameter settings as listedin Table 1. The other is a trial-and-error method based ontuning individual parameters. This approach assumes thatall parameters have a concave upward effect on system performance. More specifically, the trial-and-error method tunesthe system starting from an arbitrary parameter and fixes theremaining parameters. The parameter setting that producesthe best performance is selected as the optimal value for thisparameter. Then the agent goes to the next parameter. Onceall the parameters are processed, the resulted parameterssettings are considered as the best configuration for thesystem. This approach mimics the way an administrator mayuse to tune the system manually.In this experiment, we dynamically changed the systemcontexts to evaluate the adaptability of the RAC agent. Thesystem stayed in one context for 30 iterations before switching to a new one. Figure 5 plots the online performanceof the RAC agent compared to the other two methods inthree consecutive system contexts: context-1(from 0 to 30iteration), context-2(from 31 to 60 iteration) and context3(from 61-90 iteration).From Figure 5, we can see that RAC agent performed bestamong the three approaches. It was always able to drive thesystem to a stable state in less than 25 interactions with theexternal environment. Its overall performance was around30% better than the trail-and-error agent and 60% better thanthe static default configuration. Furthermore, the RAC agentwas able to adapt to context change in a timely way. Afterthe client traffic changed at the 30th iteration, the RAC agentIn this section, we evaluate the effectiveness of the RLbased auto-configuration agent on a multi-tier web systemrunning TPC-W benchmark. The application level performance is measured in terms of the response time.5.1. Experimental SetupTo evaluate the effectiveness of the RAC approach, wedeployed a multi-tier website in a VM-based environment.The physical machine was configured with two Intel quadcore Xeon CPUs and 8GB memory. A client machinewith the same hardware configuration was used to emulateconcurrent customers. All experiments were conducted inthe same local network.The physical machine hosting the multi-tier website installed Xen virtual machine monitor(VMM) version 3.1.Xen is a high performance resource-managed VMM, whichconsists of two components: a hypervisor and a driverdomain. The hypervisor provides the guest OS the illusion ofoccupying the actual hardware devices. The driver domainis in charge of managing other guest VMs and executesresource allocation policies. In our experiments, both thedriver domain and the VMs were running CentOS 5.0 withLinux kernel 2.6.18. The multi-tier website was deployedon two VMs, with Apache web server in the first one andTomcat application server and MySQL database server inthe other one. The RAC agent resided in the driver domain.We evaluated the RAC approach using the TPC-W benchmark [10]. TPC-W benchmark defines three different workload mixes: ordering, shopping, and browsing. They have7

Response Time(ms)1400study whether it is necessary to refine the learned policiesthrough online learning.Figure 6 compares the agent’s performance with and without online learning. The agent without online learning drovethe system to a stable configuration in 12 iterations less thanthe one uses online learning. However, the policy refined byonline learning achieved better stable performance. Thereis approximately 50% performance improvement from theonline refinement than the offline trained policy. The slowerconvergence to a stable state and the fluctuations at thebeginning in online learning attribute to the process of onlineinteractions which involve a certain amount of explorationactions. Explorations are often considered as sub-optimalactions that explore environment dynamics. Although onlinelearning suffered a longer convergence time and initialfluctuation, it was able to find a better configuration thanthe offline policy.RL based agentStatic default configurationTrial and error120010008006004002000510 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90Number of iterationsFigure 5.Performance due to different autoconfiguration policies.continuously observed performance violations and switchedpolicy at iteration 35. The response time dropped more than60% after RAC agent got the new initial policy. The newpolicy can lead the system to a stable configuration within15 iterations. More importantly, the RAC agent consistentlyimproved the performance during the process of parameterreconfiguration. It could optimized the cumulative rewardduring the reconfiguration steps avoiding severe performancedegradation.As we expected, the static default configuration yieldedthe worst performance in most of the test cases. Becausethere was no adaptation to the system context variations,the static configuration was not suitable for dynamic environment. For most of the time, the trail-and-error agentproduced a much better performance and it was able to drivethe system to a stable state. However, because this approachwas based on tuning individual parameters independently,the agent was prone to being trapped in local optimalsettings. Figure 5 shows that the performance of the s

example, both of the Apache server and Tomcat server have more than a hundred configurable parameters to set for dif-ferent running environments. In a multi-component system, the interaction between the components makes performance tuning of the parameters even harder. Performance optimiza-