Load and Network Aware Query Routing for Information IntegrationWen-Syan Li Vishal S. Batra Vijayshankar Raman Wei HanK. Selçuk Candan Inderpal NarangIBM Almaden Research Center650 Harry Road, San Jose, CA 95120 USAEmail: [email protected] federated systems deploy cost-based query optimization mechanisms; i.e., the optimizer selects a globalquery plan with the lowest cost to execute. Thus, cost functions influence what remote sources (i.e. equivalent datasources) to access and how federated queries are processed.In most federated systems, the underlying cost model isbased on database statistics and query statements; however, the system load of remote sources and the dynamicnature of the network latency in wide area networks arenot considered. As a result, federated query processingsolutions can not adapt to runtime environment changes,such as network congestion or heavy workloads at remotesources. We present a novel system architecture that deploysa Query Cost Calibrator to calibrate the cost function basedon system load and network latency at the remote sourcesand consequently indirectly ”influences” query routing andload distribution in federated information systems.have two phases, a compile time phase and a runtime phase:Compile time phase:1. A user query submitted to the II is intercepted by theQuery Patroller. The user query statement and querysubmission time are recorded. The query is then forwarded to II for further processing.2. II looks up the nickname (i.e. local names of theremote database tables) definitions in the user queryand breaks (i.e. rewrites) the query into multiple subqueries and forwards these sub-queries to the appropriate wrappers according to their types.3. For those sub-queries to be forwarded to a relationalwrapper, the wrapper will return the query fragmentsthat can be executed at each remote server and their estimated costs. For those sub-queries that are forwardedto a file wrapper, file paths are returned to II withoutestimated cost. For the cost estimation purpose, thewrapper may contact remote servers to obtain possiblesupported execution plans and their estimated costs.1 IntroductionA wide variety of applications require access to multiple heterogeneous, distributed data sources. By transparently integrating such diverse data sources, underlying differences in DBMSs, languages, and data models can be hidden and users can use a single data model and a single highlevel query language to access the unified data through aglobal schema.To address the needs of such federated informationsystems, IBM has developed the DB2 Information Integrator (II) [10] which provides relational access to bothrelational DBMSs and non-relational sources, such asfile systems and web services. These data sources areregistered at II as nicknames and thereafter can be accessedvia wrappers. Statistics about the remote data sourcesare collected and maintained at II for later use by the IIquery optimizer for costing query plans. Figure 1 shows anarchitectural overview of the federated information systemdesign using DB2 II. The operational flows on the systemRuntime phase:Proceedings of the 21st International Conference on Data Engineering (ICDE 2005)1084-4627/05 20.00 2005 IEEE1. After II receives all query fragments that can be executed at the remote sources and their estimated costs,the II query optimizer performs global query optimization. The query fragments selected by the query optimizer and their estimated costs as well as the estimatedexecution cost of the global query plan are stored in theexplain table. Other information stored includes theexecution descriptors of the selected query fragmentsthat are needed to executed the query fragments at theremote servers.2. II then forwards the query fragments selected by theglobal query plan to the wrapper for the remote sourcesfor execution.3. Query fragments are executed at the remote serversand results are returned to II through the wrappers. Theresults are merged by II and then sent back to the user.

Figure 1. Architectural Overview of FederatedInformation Systems4. After the query execution is completed, Query Patroller records the query completion time in the log forfuture use.1.1 Problem StatementDB2 Information Integrator (II) deploys cost-basedquery optimization to select a low cost global query planto execute. Thus, cost functions used by II heavily influence what remote servers to be accessed and how federatedqueries are processed. Cost estimation is usually based ondatabase statistics, query statements, and the local and remote system configuration (including the CPU power andI/O device characteristics such as seek time and transmission rates). In addition, DB2 allows the system administrator to specify expected network latencies between II and theremote servers. Cost functions have significant impacts onthe choice of remote servers (i.e. equivalent data sources);however, existing cost functions do not consider the loads on the remote servers; dynamic nature of network latency between remoteservers and II; and the availability characteristics of the remote sources.As a result, federated systems can not dynamically adapt toruntime environment changes, such as network congestionsor load spikes at the remote sources. Furthermore, since thequery plans are generated via cost-based decision makingprocess, currently, there are no mechanisms to avoid fastbut unreliable sources, when alternatives are available. Inaddition, the II query optimizer optimizes user queries individually rather treating a workload as a whole. In somescenarios, selecting a low cost global query plan and applying this plan to all similar queries is not necessarily idealwhen the workload needs to be distributed among alternative servers for better overall system performance via loadbalancing.In this paper, we introduce a novel system architecturethat deploys a query cost calibrator (QCC) to calibrate thecost function based on system availability, process and network latencies at the remote sources, and the system load atthe information integration (II). QCC transparently adaptsthe cost functions to the runtime properties of the environment, and indirectly influences the federated query optimizer. This enables the selection of the right remote sourcesor replicas (i.e equivalent data sources) which yield thefastest overall query response time for a workload. QCCalso identifies alternative query plans and recommends loadbalancing strategies to improve overall system performance.We have implemented the II middleware with QCC andevaluations show that it consistently outperforms a prototype version of DB2 Information Integrator.1.2 Paper OrganizationThe rest of paper is organized as follows. In Section 2,we describe the architectural design of the current DB2 Information Integrator featuring adaptive query routing andload balance functionalities. In Section 3, we introduce thenovel Query Cost Calibrator (QCC) component and its keyfeatures. In Section 4, we describe the load distributionscheme used in our system for load balance. In Section 5,we present the experimental evaluation of the proposed solution to demonstrate the benefits of the QCC in improvingresponse time. In Section 6, we discuss the related workand compare with the work we present here. In Section 7,we give our concluding remarks.2 Proposed System ArchitectureIn this section, we describe how our proposed system architecture (shown in Figure 2) builds on top of this architecture and enhances it transparently with two complementarycomponents: (1) a meta-wrapper (MW) and (2) a query costcalibrator (QCC).The meta-wrapper serves as a middleware between theinformation integrator and wrappers. At the compile time,MW receives queries from II and records (a) the incoming federated query statements, (b) the estimated cost of thefederated queries, (c) the outgoing query fragments, and (d)their mappings to the remote servers.During the run time, MW records (e) the response timeof each query fragment. This information is forwarded toQCC for further processing and analysis. Based on the estimated cost at the compile time and actual execution timemonitored at the runtime, QCC derives an up-to-date queryfragment processing cost calibration factor. Using this factor, QCC can dynamically calibrate the future cost estimatesProceedings of the 21st International Conference on Data Engineering (ICDE 2005)1084-4627/05 20.00 2005 IEEE

Figure 2. Proposed Federated Information Systems with Query Cost Calibrator (QCC)so that various system characteristics, such as remote system loads and network latencies, are implicitly taken intoconsideration while estimating query fragment processingcosts. In addition to such transparent statistics collection,QCC also uses daemon programs that periodically accessremote sources, through MW, to ensure their availability.The daemon programs are also used to derive initial querycost calibration factors by exploring the network latencyand processing latency at remote sources.When wrappers do not provide cost estimation and accessing remote servers to get cost statistics is not feasible, QCC deploys a simulated federated system that has thesame II, meta-wrapper, and wrappers as same as the originalrun time system as well as the simulated catalog and virtualtables, to capture database statistics and server characteristics without storing the actual data. The simulated federated system allows QCC to derive alternative query plansand perform ”what-if” analysis for query routing and dataplacement.After query compilation at II, only the global query planwith the lowest cost is stored in the explain table. Whenqueries are unique, this approach of choosing low cost plansis suitable. However, if there is a large number of similar queries that use the same plan, then the remote serversinvolved in this plan can get overloaded, rendering the original statistics invalid. To prevent such hot-spots and achieveproper load distribution, through the calibration and queryrouting of QCC, II is enabled to use alternative (maybe notthe lowest cost, but close) global query plans in addition tothe lowest-cost query plan.In addition to collecting and using cost statistics, QCCalso records error messages (if any) from accessing remoteservers for assessing their availability and reliability. Thisinformation is later used to compute the reliability factor forcost calibration. Consequently, QCC influences II to accessnot only high performance but also highly available remoteservers.3 Query Cost CalibratorIn this section, we describe the design and functions ofthe Query Cost Calibrator. The parameters associated withcost functions in II include first tuple cost, next tuple cost,and cardinality, and total cost (i.e. first tuple cost nexttuple cost cardinality). QCC calibrates first tuple cost,next tuple cost, and total cost. These costs and cardinalityare returned to II and are then converted to the parametersthat the optimizer uses internally for the cost functions. Thecost refereed in this paper is total cost for simplifying thepresentation.3.1 Query Cost Calibration for System Load andNetwork LatencyFigure 3 illustrates processing steps of a federated query,Q1, integrating data from two sources, S1 and S2 (underlined numbers in the figure indicate sequence numbers ofthe processing steps). II accesses the individual sourceswith query fragments and merges the results locally. Duringthe compile time, Q1 is transformed into two query fragments, QF 1 and QF 2, for S1 and S2 respectively, and bothquery fragments are forwarded to the meta-wrapper (MW).MW forwards QF 1 and QF 2 to the corresponding wrappers for cost estimation.The wrappers compute and return possible executionplans for these query fragments along with their estimatedcosts. In this example, two plans (QF 1 p1 and QF 1 p2)are returned for QF 1 and two plans (QF 2 p1 and QF 2 p2)Proceedings of the 21st International Conference on Data Engineering (ICDE 2005)1084-4627/05 20.00 2005 IEEE

are returned for QF 2. In addition to forwarding this information to II, MW passes these plans and the associated coststo QCCs (shown at the right hand side of Figure 3), whichrecord it for future use.The II query optimizer then selects a set of query fragments in the global query plan. Say, QF 1 p1 and QF 2 p2are selected. At the compile time phase, neither MW norQCC knows which query fragment plans are selected bythe II query optimizer. After the selection of the query execution plan by the II optimizer, the run time phase of queryprocessing starts (Figure 4). During the execution, the selected fragment plans, QF 1 p1 and QF 2 p2, are sent toMW, which then forwards the corresponding execution descriptors to appropriate remote servers through wrappers.After the execution of the query fragments are completed,once again, MW not only passes the results back to II, butalso keeps track of the response times of the individualquery fragments (in this example, 8 and 7 for QF 1 p1 andQF 2 p2 respectively) and passes this information to QCCwhich stores it for future use.At this point, QCC has the record of two sets of costsfor each query fragment (if selected by the II query optimizer); the estimated cost (obtained in the compile phase)and the observed cost (response time obtained in the runtime phase). Assuming that the original cost estimates arevalid, any significant difference between these two sets ofvalues has to be caused by variations in the network latencies or processing cost variations at the remote sources dueto their local loads.These external (and dynamic) factors are not explicitlyknown to II and are not included in the cost model. However, their combined effects can be captured using a singlequery fragment processing cost calibration factor per datasource (and query fragment if runtime statistics is available), defined as the ratio of the average runtime cost vs. theaverage estimated cost. This allows II query optimizer toconsider network and process latencies at the remote serverswithout having to observe these factors explicitly. Once thecalibration factors are calculated, they can be used for calibrating estimated costs for future, yet-unseen query fragments. For this purpose, QCC computes per source queryfragment processing cost calibration factors. In our simplified running example, the calibration factors for S1 and S2can be calculated as 1.6 (i.e. 8/5) and 1.4 (i.e. 7/5) respectively.Using these server cost calibration factors, II can calibrate other query fragments which have no runtime costrecords. This process is depicted in Figure 5. Here a newfederated query, Q5, is issued to II. II transforms the queryinto two query fragments, QF 1 (seen before) and QF 3 (notseen before). At the compile phase, for QF 3 MW consultsthe wrapper to get a query plan, QF 3 p1 and its estimatedcost, 8. However, instead of returning this estimated costdirectly, MW calibrates the cost to 11.2 by multiplying theestimated cost, 8, by the per server query fragment processing cost calibration factor, 1.4. On the other hand, since wealready have a plan and an estimated cost for QF 1, MWcan compute the calibrated runtime cost without having toconsult the wrapper as MW performs for QF 1 in this example.3.2 Cost Calibration for Information IntegrationII query optimizer uses a cost model, which includes thesize of the query results at the remote sources as well as theadditional cost of merging and aggregating the results locally, to estimate the cost of global query plans. In this process, II considers various database statistics and the physicalcharacteristics of the server where II is located. However,the standard II cost model does not consider the impact ofsystem load on II in local integration. To improve the quality of the estimates for the response times seen by end usersor applications, we further calibrate the cost observed at theII level, using a workload cost calibration factor. For thispurpose, we use the calibrated cost of the sources instead ofthe estimated cost, since the calibrated cost is closer to theruntime response time. In Figure 6, we illustrate this process with a simplified example in which we use executionhistory at II to calculate the work load cost calibration factor. Note that the table maintained in QCC for II query costcalibration factors is different from the table maintained forquery fragment processing cost calibration factors.3.3 Consideration of Remote Source AvailabilityQCC uses daemon programs to periodically access remote sources to verify their availability. In addition, fromthe query execution log provided by the query patrollerand MW, QCC can detect the system down event at remotesources. When a system down event is detected, QCC temporally adjusts cost functions for these unavailable serversto infinity so that no query fragments will be forwarded tothese remote sources. Note that the runtime log provided bythe query patroller and MW enables QCC to influence II notto route queries to the unavailable remote sources. On theother hand, the status reports from the daemon programs allow QCC to make unavailable remote sources be consideredby II again once the remote resources become available.3.4 Dynamic Adjustment of Calibration CycleSince each remote server has different network and processing latencies, the query fragment processing cost calibration factors are calculated per remote server. Furthermore, dynamic nature of the network and processing latencies at each remote server can vary dramatically. Thus, thefrequency of re-calibration does have impact to effectiveness of QCC in influencing II query optimization.Proceedings of the 21st International Conference on Data Engineering (ICDE 2005)1084-4627/05 20.00 2005 IEEE

Figure 3. Interaction between QCC and Meta-wrapper (Compile Time)Figure 4. Interaction between QCC and Meta-wrapper (Runtime)Figure 5. Interaction between QCC and Meta-wrapper with Calibrated Cost (Compile Time)Proceedings of the 21st International Conference on Data Engineering (ICDE 2005)1084-4627/05 20.00 2005 IEEE

Figure 6. Calibration for Information IntegrationQCC maintains aggregated histories of the various dynamic values associated with the remote source access coststo compute and maintain running averages. The informationenables QCC to dynamically adjust calibration cycles forthe reliability factor, query fragment processing cost calibration factor, II processing cost calibration factor, and simulated catalog refreshes.3.5 Adaptive Query RoutingAbove, we have described the novel functions of thequery cost calibrator. By monitoring the runtime environment and query execution status, QCC dynamically calibrates the cost functions of the II query optimizer and thusinfluences the selection of global query execution plans.The introduction of QCC to II query optimization allowsadaptive query routing based on current runtime environment instead of fixed query routing plans that are predetermined when the mapping between nicknames and remote sources and network latency values are defined. QCCcan detect unavailability of the remote sources and adjustthe cost functions of the remote sources so that II would notconsider routing queries to these unavailable sources. QCCalso incorporates reliability into the decision process of remote source selection.Note that QCC influences the II optimizer to select theglobal query execution plans that have the lowest cost according to current runtime environment instead of introducing a new component or requiring the modification of thecode at II optimizer. This transparent design gives QCCgreat flexibility in customizing cost functions for different business applications that may demand incorporationof unique business logic, such as QoS goal and reliability,outside of DB2 and II without modification of the DBMS.In the next section, we will describe more advanced queryrouting schemes for load distribution.4 Load DistributionThe query optimizer optimizes incoming queries individually, by selecting the cheapest global query plan, rathertreating the workload as a whole. Selecting a low costglobal query plan and applying this plan to all similarqueries is not necessarily ideal since it tends to overloada small group of servers. In some scenarios, the workloadneeds to be distributed in a balanced way among alternativeservers for better overall system performance. In this section, we describe a load balancing scheme to overcome thisobstacle.Figures 7 and 8 illustrate an example scenario which depicts how the proposed load distribution scheme works. Inthis scenario, there are four remote servers, S1, S2, R1, andR2. R1 is a replica of S1 and R2 is a replica of S2. A federated query, Q6, is submitted to II and the query requiresjoin operations across the two sources.Query routing for load balancing can be performed at thequery fragment processing or at the global query processinglevels. We describe these two alternatives in the followingsections.4.1 Load Balance at Query Fragment LevelIn this example, Q6 has two query fragments: QF 3 andQF 4. As shown in Figure 7, the calibrated costs of twoquery fragment plans, QF 3 p1 using S1 and QF 3 p3 using R1, are close to each other. If QF 3 p1 and QF 3 p3 areidentical, they are exchangeable for global query processingpurposes since they correspond to the same query fragment.When QF 3 p1 is selected by II in the global plan, QCC candistribute the load across S1 and R1 by clustering QF 3 p1and QF 3 p3 together and using them interchangeably. Notethat exchangeable query fragment processing plans need tobe identical, since two different query fragment processingplans may result in different global processing plans withdramatically different costs even they have an identical calibrated cost.The guidelines QCC follows for load distribution can besummarized as follows: For each query fragment processing plan selected byII, QCC locates alternative plans that can execute thesame query fragment with close calibrated costs (e.g.within 20%). QCC clusters these plans in a set and rotates in a roundrobin fashion for the future requests to the corresponding query fragment.Note that the workload of the fragment query (i.e. calibratedcost times frequency of queries issued in a given period)must be greater than a preset threshold value in order for thefragment query to be considered load distribution. The process is repeated periodically as calibrated costs may change.The load balancing at the query fragment processingplan level provides a simple implementation, but may misscertain global opportunities for load distribution. For example, this approach cannot provide load balance for federatedqueries with join operations across multiple remote sources.Proceedings of the 21st International Conference on Data Engineering (ICDE 2005)1084-4627/05 20.00 2005 IEEE

Figure 7. Deriving and Costing Alternative Federated Query PlansFurthermore, it can not provide load balancing for substitution of different query fragment processing plans of similarcost that result in similar global query processing costs.4.2 Load Balance at Global Query LevelAs illustrated in Figure 7, during the compile time, MWrecords possible execution plans and their costs for the twoquery fragments, QF 3 and QF 4, of Q6. The origin serversS1 and S2 can support two query fragment execution plansrespectively while the replica servers R1 and R2 can support only single query fragment execution plans. As a result, the II query optimizer has nine global query execution plans to choose from. QCC has information of allquery fragments and their estimated costs. However, sinceII query optimizer only stores the ”winner plan” in the explain table, QCC does not have knowledge about other alternative plans and their costs.To carry out load balance at the global query level, QCCneeds to derive all possible global execution plans. QCCutilizes the simulated federated system shown in Figure 2to generate all alternative global execution plans, Q6 p1 toQ6 p9, and estimate their calibrated costs. QCC achievesthis by iterating through possible query fragment pairs oneat a time at the wrapper level as shown on the right handside of Figure 7. Calibrated costs of query fragments areused to estimate the cost of the global query plan. The costof the alternative global query plans are then calibrated bythe information integration cost calibration factor.Once the calibration costs of all alternative query plansare derived, QCC can eliminate plans that are not promising. In our example, Q6 p1, Q6 p2, Q6 p3, Q6 p4, andQ6 p7 are eliminated since there are cheaper plans runningon exactly the same set of remote servers. For example,Figure 8. Selection of Federated Query Plansfor Load Balanceboth Q6 p1 and Q6 p5 are executed at S1 and S2. Wewould certainly select Q6 p5 over Q6 p1 since it is cheaper.Next, QCC identifies that Q6 p5, Q6 p6, and Q6 p8have similar costs (i.e. within 20%) and are executed ondifferent sets of servers. Therefore, QCC can select Q6 p5,Q6 p6, and Q6 p8 to form a group of plans to recommendsto II in a round robin fashion for Q6. By selecting the plansin this way, II distributes the load to multiple servers in abalanced way instead of to few servers.The guidelines QCC uses to select query plans for roundrobin load distribution are as follows:Proceedings of the 21st International Conference on Data Engineering (ICDE 2005)1084-4627/05 20.00 2005 IEEE For global query plans whose fragment queries areexecuted on the same set of servers, QCC picks thecheapest plan. QCC selects the cheapest plan and other alternativeplans whose costs are close to that of the cheapest plan

(e.g. within 20%) as a set of query plans to rotate inthe round robin fashion.The workload of a query (i.e. calibrated cost times frequency of queries issued in a period) must be greater thana preset threshold value in order for the query to be considered load distribution.QCC can use additional information to reduce the costof identifying an effective load distribution scheme. For instance, although in Figure 8 we show that there are ninealternative global query plans. The query fragments are labeled with the remote servers that return the fragments. Inthe actual implementation, QCC needs to execute Q6 in theexplain mode only four times:1. MW provides only the query fragment processingplans at S1 and S2 to II. The implementation is doneby adjusting cost functions of R1 and R2 to infinity sothat only the query fragment processing plans at S1and S2 will be considered. Q6 p5 will be selectedwhile Q6 p1, Q6 p2, and Q6 p4 will be eliminated.2. MW provides only the query fragment processingplans at S1 and R2. Q6 p8 will be selected whileQ6 p7 will be eliminated.from the sample database schema provided along with regular DB2 installments. Each table has been populated withrandomly generated data. Furthermore, the tables are replicated and distributed on the three remote servers such thateach server is involved in a diverse set of queries.In order to observe and interpret the performance of theproposed query calibration mechanism, we experimentedwith a diverse set of query types, each with different remotesource integration needs: (1) different query types need different numbers of tables, (2) the amount and type of II processing needed to merge data from multiple sources differfor different queries, and (3) the number of replicas available for the required tables varies from query to query. Thetables sizes also varied, with small tables having on to orderof 1000s of tuples and large tables having on the order of100000s of tuples.5.1 ProcedureGiven these scenarios, we carried out the experiments asfollows:3. MW provides only the query fragment processing planat R1 and S2. Q6 p5 will be selected while Q6 p3will be eliminated.4. MW provides only the query fragment processingplans at R1 and R2. Q6 p8 will be selected.Furthermore, QCC does not need to include query fragments from all remote sources in deriving alternative globalplans. Since QCC maintains the server cost calibration factors for all remote sources, it can exclude those remotesources with very high server cost calibration factors frombeing considered as candidates for query routing destinations.5 ExperimentsIn this section, we experimentally evaluate the performance and reliability gains obtained using QCC in information integration. More specifically, we aim to observewhether (1) as the server loads change, the QCC can learnthe query-cost calibration factors, successfully, (2) whetherthe cost factors learned by the QCC help II to pick betterplans than those it would pick without transparent cost calibration by QCC, and whether (3) QCC can help the systemachieve load balance by inducing the II and the MW to usedifferent plans during each iteration.In order to observe these three aspects of QCC deployment, we created an information integration scenario withone II server and three remote servers, each hosting a IBMDB2 DBMS. We populated the remote servers with tablesProceedings of the 21st International Conference on Data Engineering (ICDE 2005)1084-4627/05 20.00 2005 IEEE Step 1. Query fragment generation: Queries in theworkload are processed by II and the relevant queryfragments are generated. Step 2. Query-fragment cost estimation: Query fragments obtained in the previous step are forwarded tothe available servers in the explain mode and the corresponding costs are observed. Step 3. Baseline query-fragment cost observation:Query fragments obtained in the first step are forwarded to the available servers and the correspondingserver response times are observed. Step 4. Heavy-server-load query-fragment cost observation: Servers are hit with a heavy update load, andthe query fragments obtained in the first step are reforwarded to the available servers and the corresponding he

DB2 Information Integrator (II) deploys cost-based query optimization to select a low cost global query plan to execute. Thus, cost functions used by II heavily inu-ence what remote servers to be accessed and how federated queries are processed. Cost estimation is usually based on database statistics, query statements, and the local and re-