
Transcription
Clio: Schema Mapping Creation and DataExchangeRonald Fagin1 , Laura M. Haas1 , Mauricio Hernández1 ,Renée J. Miller2 , Lucian Popa1 , and Yannis Velegrakis31IBM Almaden Research Center, San Jose, CA 95120, USA2University of Toronto, Toronto ON M5S2E4, Canada3University of Trento, 38100 Trento, [email protected], [email protected], [email protected],[email protected], [email protected], [email protected]. The Clio project provides tools that vastly simplify information integration. Information integration requires data conversions tobring data in different representations into a common form. Key contributions of Clio are the definition of non-procedural schema mappingsto describe the relationship between data in heterogeneous schemas, anew paradigm in which we view the mapping creation process as one ofquery discovery, and algorithms for automatically generating queries fordata transformation from the mappings. Clio provides algorithms to address the needs of two major information integration problems, namely,data integration and data exchange. In this chapter, we present our algorithms for both schema mapping creation via query discovery, and forquery generation for data exchange. These algorithms can be used inpure relational, pure XML, nested relational, or mixed relational andnested contexts.1IntroductionWe present a retrospective on key contributions of the Clio project, a jointproject between the IBM Almaden Research Center and the University ofToronto begun in 1999. Clio’s goal is to radically simplify information integration, by providing tools that help in automating and managing one challenging piece of that problem: the conversion of data between representations.Clio pioneered the use of schema mappings, specifications that describe the relationship between data in two heterogeneous schemas. From this high-level,non-procedural representation, it can automatically generate either a view, toreformulate queries against one schema into queries on another for data integration, or code, to transform data from one representation to the other for dataexchange. In this chapter, we focus on two key components of Clio: the creation ofmappings between heterogeneous schemas, and their use for the implementationof data exchange.A.T. Borgida et al. (Eds.): Mylopoulos Festschrift, LNCS 5600, pp. 198–236, 2009.c Springer-Verlag Berlin Heidelberg 2009
Clio: Schema Mapping Creation and Data Exchange1.1199Schema MappingSchema mappings are fundamental for a number of important information integration problems [9] including data integration, data exchange, peer-to-peerdata sharing, schema integration and schema evolution. Applications are typically limited to handling information with a specific schema, so they rely onsystems that can create and use mappings to transform data from one representation to another.A fundamental requirement for Clio is that it make no assumption about therelationship between the schemas or how they were created. In particular, we donot assume that either of the schemas is a global or mediator schema, nor thatone schema is a view (global or otherwise) over the other. This implies that bothschemas may contain data not represented in the other, and that both may havetheir own constraints.This requirement to map independently created schemas has a strong impacton our mapping language, as we need one that is more general than those used intraditional schema integration [6] or in mediator systems such as TSIMMIS [16]or Information Manifold [34].A second requirement is that we be able to map between relational schemasand nested schemas (for example, XML schemas). As XML emerged as a commonstandard for exchanging data, an early motivating application for our work waspublishing legacy relational data in XML. This often requires relational data tobe placed into a predefined XML schema (defined, e.g., by a standards committeeto permit meaningful exchange within a specific domain). However, for otherapplications including schema evolution, data warehousing, and data federation,we also need to be able to map data between different relational schemas andbetween any combination of nested and relational schemas.A third requirement is that we be able to create and use mappings at differentlevels of granularity. For some applications and some schemas, it may be sufficientto create fine-grained mappings between individual components (for example,between attributes or elements to translate gross salary in francs to net salaryin dollars). For others, mappings between broader concepts are required (forexample, between the order concept in one billing application with that usedby another). And for other applications, we may need to map full documents(for example, map every company’s carbon emission data expressed in a schemasuitable for the European Union Emission Trading Scheme to a schema designedfor the Portable Emissions Measurement Systems standard).Finally, we want our mapping creation algorithms to be incremental. Thereare many motivations for this. First, for many tasks, complete mapping of oneentire schema to another is not the goal. It may suffice to map a single concept toachieve the desired interoperability. Second, we want a tool that gives users (orsystems) with only partial knowledge of the schemas, or limited resources, useful(and usable) mappings despite their incomplete knowledge or resources. We hopethat incomplete mappings can help them in understanding the schemas anddata better, and that the mappings can be refined over time as need arises, forexample, as new data appears, or the application needs change. This particular
200R. Fagin et al.aspect of our approach was explored in more detail in our work on data-drivenmapping refinement [51] and in work on mapping debugging [3], but will not beemphasized in this chapter. This ability to evolve mappings incrementally hasmore recently been coined pay-as-you-go [23].Clio mappings assume that we are given two schemas and that we would liketo map data from the first to the second. We refer to the first schema as a sourceschema, and the second as a target schema. In practice, this meets most application needs, as most require only a uni-directional flow of data. For example,one common use of mappings is in query reformulation, commonly referred toas data integration [33], where queries on a target schema are reformulated, using the mappings, into queries on a source schema. For applications requiringbi-directional mapping, mappings are created in both directions.1.2Implementing Data ExchangeAnother common use of mappings is for data exchange where the goal is to createa target instance that reflects the source instance as accurately as possible [19].Since the target is materialized, queries on the target schema can be answereddirectly without query reformulation. At the time we started Clio, data integration had been widely studied, but work on data exchange was quite dated.Foundational systems like Express [46,47] did data exchange for mappings whichwere much less expressive than those needed to map arbitrary schemas. Therewere no systems that performed data exchange for the general mappings westrove to create.For independent schemas, because the schemas may represent overlappingbut distinct sets of concepts, a schema mapping may relate a source instancewith many possible target instances. As a result, we have a fundamentally newproblem: given a schema mapping, determine which possible target instance isthe best one to use for data exchange. At the time of our first results [38,43],this problem had not yet been formalized. Hence, in Clio, we made some intuitive decisions that were later formalized into a theory for data exchange [19].In this chapter, we discuss this intuition, and how it corresponds to the latertheory. We also discuss some important systems issues not covered by the theory. Specifically, we consider how to create a data exchange program from aschema mapping. Due to the requirements for schema mapping laid out above,we choose to produce executable queries for data exchange. A schema mappingis a declarative specification of how an instance of a source schema correspondsto possibly (infinitely) many target instances, and from this we choose a besttarget instance to materialize. Hence, our data exchange queries, when executedon a source instance, will generate this one chosen target instance.The origins of this chapter first appeared in Miller et al. [38] and Popa etal. [43]. This chapter also includes details, originally from Velegrakis [48], ofour data exchange implementation. The requirements outlined above force us toaccommodate various runtime environments. In this chapter, we discuss how togenerate data exchange queries in SQL, XQuery or XSLT.
Clio: Schema Mapping Creation and Data Exchange201The chapter is organized as follows. Section 2 introduces a motivating exampleand describes the problem of schema mapping generation and the problem ofdata exchange. Section 3 presents the data model we will use for representingboth relational and nested relational schemas along with our schema mappingformalism. Section 4 presents our algorithm for generating schema mappings.Section 5 presents our algorithm for data exchange (mapping code generation).We present a discussion and analysis of our algorithms in Section 6, describe therelated work in Section 7 and then conclude.2A Motivating ExampleTo motivate our approach, we first walk through an example explaining theintuition behind our mapping creation algorithm, and highlighting some of itsfeatures. We then extend our example to illustrate how schema mappings canbe used to generate queries for data exchange, and describe the key innovationsin that algorithm.2.1Schema Mapping CreationSchema. Consider a data source with information about companies and grants.The structure of its data is described by the Schema S, illustrated in Figure 1. It isa relational schema containing three tables, companies, grants, and contacts,presented in a nested relational representation that we use to model both relational and XML schemas. It contains a set of grants (grants), each consisting ofa grant identifier (gid), a recipient (recipient), its amount (amount), its supervisor (supervisor) and its manager (manager). The recipient is actually thename of the company that received the grant. For each company, the databasestores its name (name), address (address) and the year it was founded (year).Similarly, the supervisor and manager are references to some contact information, which consists of an identifier (cid), an email (email) and a phone number(phone). The curved lines f1 , f2 and f3 in the figure represent referential constraints specified as part of the schema. For example, f1 may be a foreign key, orsimply an inclusion dependency, stating that values in grants.recipient mustalso appear in companies.name.Consider a second schema T , as illustrated on the right-hand side of Figure 1. It records the funding (fundings) that an organization (organizations)receives, nested within the organization record. The amount of each funding(budget) is kept in the finances record along with a contact phone number(phone). The target may be an XML schema containing a referential constraintin the form of a keyref definition (f4 ).Correspondences. To begin to understand the relationship between theschemas, we may invoke a schema matcher to generate a set of element correspondences (or matchings). Alternatively, we could ask a user (for example, adata designer or administrator familiar with the schemas) to draw lines between
202R. Fagin et al.Schema S:Schema T:organizations: Set of Rcdcompanies: Set of Rcdnameaddressyearv1fundings: Set of Rcdgrants: Set of v2fidfinIdfinances: Set of Rcdv3f3finIdbudgetphonef4contacts: Set of Rcdcidemailphonev4Fig. 1. A source and a target schema in a mapping scenarioelements that should contain related data. In our example, the dashed arrowsbetween the elements of the two schemas in Figure 1 represent a set of matchings or correspondences. The line v1 indicates (informally) that what is calleda company name in the first schema, is referred to as an organization code inthe second. In contrast, both schemas have an element year, but the data administrator (or matching tool) has specified no arrow between them. That maybe because there is reason to believe that these elements do not represent thesame concept. For instance, element year in the first schema may represent thetime the company was founded, while in the second it may represent the timethe company had its initial public offer (IPO).Our approach is agnostic to how correspondences are created, whether manually or (semi-)automatically, but is cognizant that matchings are often incomplete, and sometimes incorrect. Hence, we have developed techniques for incrementally modifying mappings as correspondences change [49]. To keep ourexample simple, we assume that correspondences v1 , v2 , v3 , v4 are correct.Mappings. One way a correspondence can be interpreted is that the targetschema element should be populated with values from the source schema element.This can be formally expressed using an inter-schema inclusion dependency ormore generally through a source-to-target tuple generating dependency, (tgd) [7].A tgd representing correspondence v1 of Figure 1 is shown below (the nested setfundings which is represented by the variable F inside organizations will beexplained below). n, d, y, companies(n, d, y) y , F organizations(n, y , F )(1)This simple mapping states that for each companies tuple, there must bean organizations tuple whose code is the same as the companies.name; this isrepresented by the shared variable n, which carries source data to the target. As aconvention, when writing tgds, we underline all the variables that appear in boththe left-hand side and the right-hand side of the implication. This target tuple
Clio: Schema Mapping Creation and Data Exchange203must have a value for the year attribute, but the mapping does not specify whatthis value should be (this is represented by the existential variable y ). Similarly,we could write simple tgds to express the other three correspondences.If our application only requires information about organizations (and notabout fundings or finances), then we can stop the mapping generation here.This simple element-to-element mapping can be used for data exchange or fordata integration [52]. However, if the user is interested in mapping more data,we can continue the mapping generation using the other correspondences fromFigure 1 to map additional concepts.Associations. Correspondences alone do not specify how individual data valuesshould be connected in the target. For example, in the target, funding information is nested inside organizations. This nesting indicates that there is a semantic association between organization and funding records. This may representorganizations and the funding that the organization has received, or possibly organizations and the funding that they have given to others. The semantics is notspecified precisely in the schema, but it is clear that some real-world associationis being represented. Given this, we will look for associations between organization information and funding information in the source to see if one of these canbe used to associate data in the target. For our example, organizations.codecorresponds to companies.name, while fundings.fid corresponds to grants.gid.Hence, our algorithm will search for associations between these source elementswithin the source schema. In our example, the referential constraint f1 indicatesthat each grant is associated with a company, thus this constraint can be usedto associate each company with a set of grants. In general, there may be manyways to associate elements within a schema. Our algorithm will use logical inference to find all associations represented by referential constraints and a schema’srelational and nesting structure.For our example, we have only one way of associating company names andgrant gids in the source, so we will use this association to associate fundings withorganizations in the target. A mapping reflecting this association is representedby the following formula. n, d, y, g, a, s, m companies(n, d, y), grants(g, n, a, s, m) y , F, f organizations(n, y , F), F(g, f )(2)The variable F in formula (2) does not represent an atomic value, but rather aset identifier, and is also used as a term in the formula. This variable representsthe set of fundings that an organizations tuple has.Notice that Mapping (2) specifies what must be true of the target data, giventhat a certain pattern holds in the source data. In this example, it says that ifa grant joins with a company in the source, then there must be an organizationin the target with the name of the company as its code, and with a fundingsrecord nested inside of the organization record that has the grant’s gid as itsfundings.fid. No other constraints are placed on what is in this set. So the
204R. Fagin et al.mapping is specifying that the association between grants and companies shouldbe preserved in the target.Now let us consider the correspondence v3 . Considered in isolation, this correspondence could be represented by the following mapping. g, r, a, s, m grants(g, r, a, s, m) f, p finances(f, a, p)(3)However, this mapping does not recognize that grant amounts are associated with specific grant gids (in the source) and that fundings.fid andfinances.budget are associated in the target (through the referential constraintf4 ). If these two associations represent the same semantic association, then abetter mapping can be constructed by using the source and target associations. n, d, y, g, a, s, m companies(n, d, y), grants(g, n, a, s, m) y , F, f, p organizations(n, y , F), F(g, f ), finances(f, a, p),(4)Notice that a company and grant tuple that join in the source will create threetuples in the target: an organizations tuple, a fundings tuple (which is nestedinside the organizations tuple), and a finances tuple. The mapping specifiesthat the fundings and finances tuples must share the same value (f ) in theirfinId attributes. It does not, however, specify what this value must be (that is,the variable f is existentially quantified and is not bound to source data).Now to complete our example, let us consider the final correspondence v4 . Inthe target, a phone is associated with a budget because they are representedin the same relation finances. In the source, there are two ways to associatea grants.amount (the source for finances.budget) and a contacts.phone (thesource for finances.phone). These are represented by the two referential constraints f2 (which associates a grant with its supervisor’s phone) and f3 (whichassociates a grant with its manager’s phone).It is not clear which, if either, of these associations should be used to createfinances tuples. Clio will create two mappings, one using f2 (Mapping (5))which uses a join on supervisor, and one using f3 (Mapping (6)) which uses ajoin on manager. To help a user decide which mapping to use, Clio provides adata viewer which allows users to see (and compare) sample target data createdby each mapping [51]. n, d, y, g, a, s, m, e, p companies(n, d, y), grants(g, n, a, s, m), contacts(s, e, p) y , F, f organizations(n, y , F), F(g, f ), finances(f, a, p)(5) n, d, y, g, a, s, m, e, p companies(n, d, y), grants(g, n, a, s, m), contacts(m, e, p) y , F, f organizations(n, y , F), F(g, f ), finances(f, a, p)(6)We have illustrated a few of the issues involved with generating schema mappings. We now highlight some of the features of the Clio mapping algorithm.
Clio: Schema Mapping Creation and Data Exchange205Mapping Formalism. As our example illustrated, we use source-to-target tgds,a generalization of the relational tgds of Fagin et al. [19] to the nested relationalmodel, to represent schema mappings. Our mappings are a form of what hasbeen called sound GLAV (global-and-local-as-view) mappings [33]. In general,a GLAV mapping asserts a relationship between a query over the source and aquery over the target. We use sound mappings, where the relationship betweenqueries is a containment relationship (the result of the source query is containedin the target query) as is common in data integration. Such mappings do notrestrict what data can be in the target; hence, we have the freedom to mapmultiple sources into a single target.Mapping Generation. Clio exploits the schema and its constraints to generate a set of alternative mappings. Our approach uses the chase technique [36]to generate all possible associations between source elements (and all possibleassociations between target elements). The mappings are formed using theseassociations or other associations given by a user.Multiple Mappings. As we create mappings, each query may omit informationthat the user may have wanted included. Consider Mapping (2). This mappingtakes grants that have an associated company and creates target data from thisinformation. Notice, however, that companies that have no grants would notbe included in the mapping. We may want to include all companies, not justcompanies with grants, in the target. To allow for this, Clio generates a setof mappings that would map all source data (in this example, both companieswith and without grants). A user can then choose among these mappings. If sheonly wishes to map a subset of the source data, she can do so by selecting anappropriate subset of the mappings Clio generates.2.2Query Generation for Data ExchangeFor data exchange, Clio can generate code that, given a source instance, willproduce an instance of the target schema that satisfies the mapping and thatrepresents the source data as accurately as possible. In general, given a sourceinstance there may be many target instances that satisfy the mapping (or manysolutions in data exchange parlance [19]). Hence, to perform data exchange, wemust choose a single “best” target instance, i.e., a single solution to materialize.Let us assume, for example, that the instance of the source schema of Figure 1is the one illustrated in Figure 2, and that a user has indicated that Mapping (5)is correct and would like to create a target instance. For each companies tuplethat joins with grants and contacts, the mapping indicates that there shouldbe an organizations tuple containing as organization code the company name.Furthermore, there must also be a nested fundings element containing the gidof the grant from the source. Finally, a finances element must also exist withthe same value for its finId as the value of finId in this fundings record.Moreover, the finances element must contain the grant amount in the budgetelement and the supervisor phone number as phone. An instance that satisfies
206R. Fagin et al.Companiesname addressMSRedmond, SAAT&T Dallas, TXIBM Armonk, NYGrantsgid recipientg1 MSg2 MSg4 [email protected] [email protected] [email protected]@attDorman 73706625236072703600102Fig. 2. An instance for the source schema in Figure 1the mapping, i.e., a solution, can be seen in Figure 3. In this solution, fundingstuple g1 can correctly be joined with the first finances tuple to associate it withits correct budget (1M). However, in Figure 3 all fundings tuples are associatedwith all finances tuples, which was not true in the source. In data exchange,we would like the target instance to represent only the data associations in thesource. Clearly the instance of Figure 3 does not fullfill that desire. Furthermore,the last tuple in the Finances table does not correspond to any source data, yetits inclusion in the instance does not violate Mapping (5). So while this instanceis a solution, it is not minimal.Fagin et al. [19] proposed the use of universal solutions for data exchange. A universal solution is an instance of the target schema that contains no more and noless than what the mapping specification requires. A universal solution for Mapping (5) is given in Figure 4. Note that the values of the finId attribute havebeen created to associate each fundings with all and only the finances tuplesthat contain the correct budget amount. The instance is also minimal in that itdoes not contain any tuples that were not required to be in the target. To computea universal solution, Fagin et al. [19] present an algorithm based on the chase [36].However, in Clio, we use queries to perform the data exchange. Here we presentalgorithms that, given a schema mapping, generate a set of executable queries toperform the desired data exchange. The queries can be in SQL, XQuery or XSLTdepending on the desired runtime environment. In the case of a pure relationalsetting (relational source and target schemas), these queries generate a universalsolution. In our algorithm, we make use of Skolem functions (one-to-one functions) that generate values based on a set of source values. We will discuss laterhow to determine the correct arguments for these Skolem functions and when oneSkolem function should be reused in different schema elements. For example, inSection 5, we show why the Skolem function for finId needs to depend on foursource attributes (name, gid, amount, and phone).In the case of a nested target schema, Clio applies additional groupingand nesting to produce a target instance that is in PNF (Partitioned Normal
Clio: Schema Mapping Creation and Data Exchange207Organizationscode yearMSFundingsfid finIdg1 10g2 10code yearAT&TFundingsfid finIdg4 10FinancesfinId 09479Fig. 3. A non-universal solution target instance for Mapping (5)Form) [1]. This is done to minimize the redundancy in the target instance.Consider the universal solution given in Figure 4 which contains a differentorganizations tuple with a singleton fundings set for each companies andgrants pair in the source, even if multiple pairs share the same company name.Clio avoids this redundancy, by producing a single organizations tuple for eachsource name and grouping all the fundings that belong to the same organization together under one single organization element (Figure 5). As shown in thefigure, we use Skolem functions to represent set identifiers for each fundingsset. Our algorithms determine the correct arguments for these Skolem functionsto achieve PNF grouping. In more recent work which will not be covered in thischapter [24], we have considered how to declaratively represent PNF groupingsemantics in the mapping specification along with other types of grouping. Inthis chapter, we will assume PNF is the desired semantics, and we present oursolutions for generating PNF target instances.There are two main innovations in our data exchange algorithm. The first isa new technique for generating Skolem terms to represent existential values andfor achieving grouping in the target instance. Second, our algorithm can identifyand merge data that are generated by different mappings, but represent thesame target entities. Assume, for instance, that the user wanted to populate thetarget with all the companies, independently of whether they have funding or not.Mapping (5) can generate only companies with grants (i.e., funding), due to thejoin it performs on the source data. Mapping (1) on the other hand, generates allthe companies, but without their potential funding. The desired target instancecan be achieved by using both mappings (5) and (1). The resulting instancewould be the same as Figure 5 but with an additional organizations elementfor IBM having an empty fundings subelement. The MS and AT&T tupleswould not be impacted, even though they are produced by both mappings.
208R. Fagin et al.Organizationscode yearMSFundingsfid finIdg1 Sk2 (MS,g1,1M,7062838)code yearMSFundingsfid finIdg2 Sk2 (MS,g2,2M,7066252)code yearAT&TFundingsfid finIdg4 Sk2 (AT&T,g4,3M,3607270)FinancesfinIdSk2 (MS,g1,1M,7062838)Sk2 (MS,g2,2M,7066252)Sk2 33607270Fig. 4. A universal solution target instance for Mapping (5)3Mapping Language and Schema ConstraintsSchemas and Types. We use a nested relational model to model both relationaland XML Schemas. In general, a schema is a sequence of labels (called roots),each with an associated type τ , defined by the grammar:τ :: String Integer Set of τ Rcd[ l1 : τ1 ,. . ., ln : τn ] Choice[ l1 : τ1 ,. . ., ln : τn ]Types Integer and String are called atomic types, Set of is a set type, and Rcdand Choice are complex types. With respect to XML Schema, we use Set of tomodel repeatable elements (or repeatable groups of elements), while Rcd andChoice are used to represent the “all” and “choice” model-groups. For each settype Set of τ , τ must be an atomic (String or Integer) type or a Rcd type. We donot consider order, that is, Set of represents unordered sets. “Sequence” modelgroups of XML Schema are also represented as (unordered) Rcd types.Instances. An instance of a schema associates each schema root l of type τ witha value v of type τ . For the atomic types, the allowed values are the expected ones(i.e., strings, integers). A value of type Rcd[l1 : τ1 , . . . , ln : τn ] is an unorderedtuple of pairs [l1 : v1 , . . . , ln : vn ] where vi is a value of type τi with 1 i n.A value of type Choice[l1 : τ1 , . . . , ln : τn ] on the other hand, is a pair li : vi where vi is a value of type τi with 1 i n. With respect to XML, the labelsl1 , . . . , ln model element names or attribute names, while the values v1 , . . . , vnrepresent the associated contents or value. In our model, we do not distinguishbetween XML elements and attributes.
Clio: Schema Mapping Creat
gorithms for both schema mapping creation via query discovery, and for query generation for data exchange. These algorithms can be used in pure relational, pure XML, nested relational, or mixed relational and nested contexts. 1 Introduction We present a retros