Transcription

Flexible Semantic B2B Integration Using XML SpecificationsKen BarkerComputer Science, University of CalgaryCalgary, Alberta, [email protected] has recently evolved as the key enabling technology tosupport B2B integration in the modern computingenvironment. Although much hype exists around the issue ofdata exchange between multiple businesses on the web, veryfew concrete proposals have evolved that are sufficientlyflexible to exploit true data exchange. Two approaches havebeen proposed to facilitate data exchange from business tobusiness. The first technique uses a standard exchange formatto which all businesses wishing to participate in data exchangemust conform. Alternatively, the businesses need to utilize amiddleware solution that is sufficiently flexible to capture anytype of data exchange but powerful enough to represent anydata that needs to be exchanged. The system captures the keyunderlying semantics of the participating business datawithout forcing conformance to an inflexible standard.Further, since our system is based on XML, it exchanges datain a format that can be utilized by any application capable ofreading XML data so both the schema and content areavailable for subsequent analysis.IntroductionMuch emphasis has recently been placed on developing suitableparadigms for conducting business-to-business (B2B)interactions. The needs of this application are inherentlydifferent than other integration models because they are formedon an as needed basis and often are ad hoc. Thus, traditionalarchitectures such as multidatabase systems [1] areinappropriate because they introduce too much overhead sincetheir primary goal is to provide integration across manydatabase systems. Past work that is most similar to the B2Bapplications is the work undertaken on federated databasesystems [6] in that “one-off” integrations between two systemsis defined. Each participant in a federated system must supplyan export schema, which describes the data being madeavailable, and an import schema, which describes the databeing extracted from a remote system. Although thearchitecture is similar to that being developed for B2Bapplications, it suffers from the explosion of mappings thatoccurs each time a new system is incorporated into thefederation.Many industrial and research B2B applications propose the useof “standards” as the mechanism to facilitate communication.Several standards have been developed to support B2B dataexchange including BizTalk [2], SIL [3], etc. Conformance toany standard is voluntary, but in many environmentscompliance is necessary to facilitate progress or evenfunctionality. For example, network and inter-networking isonly feasible if those participating are willing to follow thegenerally accepted standards. Support for network standardsacross organizations is also provided by common goals. Even iftwo organizations compete at all other levels it should beimmediately evident that they can only both exist on the sameRamon LawrenceComputer Science, University of IowaIowa City, Iowa, [email protected]“Internet”' if they are willing to support a common networkprotocol.Thus, the motivation for this work is based on the premise thatalthough standards have a role to play in B2B interactions, theywill never be a complete solution to the problem. Thisargument is based on two key facts. First, guaranteeingconformance to a standard will only work if one is defined foryour particular application environment. Secondly, it would beimpossible to anticipate all of the ways any two businessesmight want to exchange data a priori, so a more flexible,possibly proprietary protocol, needs to be developed to enabledata exchange.Architectural FrameworkBefore discussing the details of our methodology, a B2Barchitecture is presented to frame the discussion. Figure 1depicts two businesses that wish to exchange data across anetwork. A typical B2B scenario might be a supplier thatprovides some components required by a manufacturer toproduce some product. The kinds of information that mightneed to be exchanged include order information (eg. POs, SalesContacts, SKU Numbers, Shipping Dates, etc.), invoiceinformation (eg. Invoice numbers, Contact information, DueDates, etc.), and distribution channel information (eg. Shipperinformation (FedEx), delivery routing, tracking informationetc.) The likelihood that both the shipper and receiver utilizehomogeneous systems is extremely small so middleware isrequired to facilitate data exchange.Figure 1: B2B System ArchitectureTo illustrate the B2B requirements, we will track an invoicesent from the supplier to the manufacturer depicted in Figure 1.We will make the unlikely assumption that the manufacturerwants to receive the invoice as quickly as possible and that itwill be processed as soon as it arrives so the supplier can bepaid immediately. To make the problem more realistic we willassume that the manufacturer's database is driven by Oraclewhile the supplier's is IBM's DB2. Both businesses will havetheir own schema and data formats and each will access theirdatabases in quite different ways. It should be fairly obviousthat each business will have an application that is capable ofproducing an invoice, processing it and ultimately ensuring thatthe transaction is completed successfully.Thus, the supplier will generate an electronic invoice byextracting the required data from the data stored in its DB2

database. This will involve generating an internal invoicenumber, a description of the service supplied for the invoice,and a “e-packet” in the format used internally by the supplier'ssystems. This e-packet cannot be transmitted directly to themanufacturer because the format used by the manufacturer willlikely be quite different. Thus, the manufacturer and suppliermust agree on a common representation. If we assume that thetwo businesses agree on BizTalk as their exchange language,the supplier's e-packet must be converted to this format by theadapter before subsequent processing of the transaction cantake place. In Figure 1 the processing of the invoice isaccomplished using the module labelled Inter-BusinessWorkflow Management where the “business logic” isperformed. Executing this workflow will undoubtedly requiredata and/or permissions from the manufacturer, but this canonly be achieved if the request is translated from the BizTalkformat to the native format processable at the manufacturer'ssite. This requires another adaptor capable of translating therequest into an e-packet capable of executing the necessarytransaction on the manufacturer's machine. Once this has beenprocessed, the entire process must now be reversed to move theanswer, hopefully an acknowledgement of the receipt andpayment of the invoice. This requires that another e-packet(native to Oracle) be produced, which is then translated toBizTalk for further workflow processing. Finally, the result ofthis intermediate processing is then translated back to a formatunderstood by the supplier's DB2 database before a finalacknowledgement of the transaction can be made.ContributionsBased on this scenario we can now describe where this papercontributes. We are primarily interested in the middlecomponent of Figure 1. Further, we are primarily interested inthe technology necessary to facilitate data transmission throughthis middle component. The details of the workflow are clearlykey to any B2B application, but our focus is not primarily onhow to write this business logic. Rather, we are interested in thesuitability of various techniques to exchange data from onebusiness (the supplier) through the middleware component toanother business (the manufacturer). We argue thatstandardization efforts aimed at developing a universallyaccepted lingua franca for such high level B2B applicationswill never be fully successful. These systems are too rigid toadapt to unanticipated application needs and will ultimately beignored by businesses because of the costs of changingpreviously built applications that no longer conform to the newstandard. As proof of this claim, consider the enormous timelag between the proposed change from IPv4 to IPv6 in ledgement of the need for the update. Instead we arguethat successful data exchange is only achievable if a system isprovided that is readily extendable to new application needswhile providing a framework that ensures participants easilyconform to the lingua franca. Thus, our work focuses ondeveloping an extensible exchange “language” that captures thesemantics of the businesses that are willing to conform to thelingo.To this end we propose a system capable of capturing thespecific data needs for individual businesses. Although we donot believe it will be ultimately necessary, it is possible for oursystem to define specific exchange schemas between any twobusinesses in much the same way as was initially proposed forfederated systems. Thus, individual businesses can use our toolto write specific “wrappers” for each of their systems.However, based on substantial experience, we know that thereis an enormous amount of overlap between businesses. This isparticularly true when you consider who is likely to participatein a B2B exchange. To illustrate this point consider thefollowing simple example. Select any major book retailer(Barnes and Noble or Chapters) and consider for a momentwith whom they are likely to undertake a B2B transaction.Clearly the answer is a related business such as a book supplier(Morgan Kaufmann or Wiley Press). It is extremely unlikelythat a pharmaceutical business will undertake a B2B transactionwith the book retailer to sell drugs. Thus, businesses in thebook industry are likely to have a common linguo that is usedby all participants and the slight differences, that willundoubtedly exist, can be readily addressed using the systemdescribed shortly.We can now consider the key elements that must exist tofacilitate this exchange. First, the core of any conversation isthe need to have both participants speak a common language.Booksellers are able to undertake a dialog because they havecommon terms for common concepts. Thus, the first element isa “standard dictionary” that can be used to define what term isused to represent what concept. Unfortunately, data exchangebetween businesses is never based on using precisely the sameterms to represent the same semantic in both organizations.Thus, we must be able to map from the syntax used to representthe concept at an organization to its representation in thestandard dictionary. Fortunately, we only need to do this oncefor each business and it is interesting to note that the mappingfrom the standard dictionary term to the syntax used at thebusiness is simply an inverse of the first mapping. Once thedictionary is defined and the mappings are in place, executingtransactions between the two businesses requires that thebusiness workflow for all such transactions be written. Thearchitecture presented here is inherently different than pastproposals because the business logic is not accomplished usingthe language of the legacy systems at each business but ratherby using the standard dictionary's terms and concepts soapplications developers can manipulate an integrated view ofthe data for the first time.The balance of the paper describes the process for creating thestandard dictionary by presenting an architecture that describesthe capture process. Selection of the underlying representationlanguage is always a critical decision when developing anysystem suitable for integrating legacy systems. The problem isfurther complicated because no matter what selection is made,it too, will ultimately be another legacy system. Thus, we wantto select a environment that will likely have the longest possiblelife looking into the immediate future but, more importantly, issufficiently extensible to allow it to adapt to changing needsinto the more distant future. Thus, we have selected XML asthe implementation language because of its inherentlyeXtensible nature. These issues in addition to our captureprocess to create the standard dictionary are detailed in thesection describing Unity’s architecture. The integration of abusiness' data source into the system is accomplished bydefining mappings using common terms that are used by themiddleware, and is detailed afterward. The final technical

element described in this paper details how queries areaccomplished using the middeware described. Clearly, the keyelement of the workflow management component depicted inFigure 1 is only feasible if queries can be posed by one system,translated to the common language, and posed at the otherbusiness. Once queries can be asked and answered it should beevident that the system is capable of providing the middlewarenecessary for B2B processes. Next a query process is describedto detail the query processing features of our middleware andprovides an example of its utility. The penultimate sectionprovides a very brief review of other research activities leavingthe final section to summarize our insights and provide pointersfor subsequent research.Unity ArchitectureAs mentioned above, the adapters depicted in Figure 1 are thefocus of the work reported here. Although it would be temptingto consider these little more than wrappers for the participatingbusiness' database, the system is actually much more powerful.Unlike wrappers, these adapters must also provide facilities todefine the underlying ontology often provided by standardsconformance. This requires a representation of the ontology, thesoftware to produce an arbitrary ontology, translationmechanisms to a flexible semantic notation, and a queryprocessing capability so results can be exchanged from B2B.These adapters are the essence of our middleware solution,which is called Unity1 to reflect a goal of providing a unifieddata exchange mechanism. This section explores some of thedetails associated with Unity.The Unity architecture consists of five main components: astandard term dictionary, a dialect of XML used to specifymetadata (X-Specs) that captures data semantics, an integrationalgorithm for combining X-Specs into an integrated view, aquery processor for resolving conflicts at query-time, and“wrapper” software at each database site responsible foraccessing participating databases available at businesses. Thedictionary provides terms for describing schema elements andavoiding naming conflicts thereby forming an unambiguouslingua franca. The integration algorithm matches conceptsfrom X-Specs to produce an integrated view, and the queryprocessor translates a semantic query on the integrated view inthe dialog expected by the business receiving the request. Thewrapper software verifies user access to the system, processesSQL requests, and returns results.The architecture utilizes three component processes:1 Capture Process: A capture process is independentlyperformed at each data source to extract databasemetadata into a XML document called a X-Spec. Integration Process: The integration processretrieves X-Specs from each database and combinesX-Specs into a structurally-neutral hierarchy ofdatabase concepts called an integrated context view(see Figure 2-a).Unity is a proprietary system developed at the University ofManitoba. Query Process: The user formulates queries on theintegrated view that are mapped by the queryprocessor to SQL. The SQL is transmitted to eachdatabase wrapper. The results returned are integratedand formatted (see Figure 2-b).To illustrate the architecture, we use the following exampleinvolving two book databases. The first company, calledBooks-for-Less, has a database as given in Figure 3. Thesecond company, called Cheap Books, stores its database asdescribed in Figure 4. Note that database field and databasenames appear italics and semantic names in the integrated vieware in Arial Narrow.Figure 2: (a) The Integration Process of Unity; (b) TheQuery Process in UnityTablesBookFieldsISBN, Title, Author, Publisher, Price, QtyFigure 3: Books-for-Less Database SchemaTablesBookAuthorPublisherFieldsISBN, Author id, Publisher id, Title,Quantity, Price, DescriptionId, NameId, NameFigure 4: Cheap Books Database SchemaThe Capture ProcessThe capture process is an off-line procedure where thesemantics of a relational database schema are captured into aXML document called a X-Spec. The X-Spec is designed tostore sufficient schema metadata such that it can be comparedand integrated across systems.Using a standard termdictionary allows related concepts to be uniquely identified byname.Standard DictionaryThe foundation of the architecture is the acceptance of astandard term dictionary which provides terms to representconcept semantics that are agreed upon across systems. Thus,the architecture operates under the assumption that naming

conflicts are prevented by utilizing standard terms to exchangesemantics. Without a standard set of terms or names tocommunicate knowledge, knowledge cannot be integrated orexchanged because its semantics are not known. Thus, byaccepting a standard dictionary, schema, or set of XML tags, asystem assumes away the naming problem by accepting alexical semantic framework for the expression of datasemantics similar to our human acceptance of spoken languagesto facilitate communication.The standard dictionary is a hierarchy of concept terms.Concept terms are related using IS-A' relationships formodeling generalization and specialization and HAS-A'relationships to construct component relationships.Constructing Semantic NamesA semantic name captures system-independent semantics of arelational schema element by combining dictionary terms. Inthe relational model, a semantic name is a context if it isassociated with a table and a concept if it is associated with afield. A context contains no data itself and is described usingone or more concepts. A semantic name, which is a concept,represents atomic or lowest-level semantics. In relation to theobject-oriented model, a context is like an object, and a conceptis an attribute of an object.A semantic name consists of a context and concept portion. Thecontext portion is one or more terms from the dictionary, whichdescribe the context of the schema element. Adjacent contextterms are related by either IS-A (represented using a “,”) orHAS-A (represented using a “;”) relationships. The conceptportion is a single dictionary term called a concept name and isonly present if the semantic name is a concept (maps to a field).The formal specification of a semantic name (sname) is:sname:: CTerm:: [CTerm] [CTerm] CN CT CT ; CTerm CT , CTerm where CT and CT are dictionary terms.The semantic names for Books-for-Less and Cheap Books aregiven in Figure 5 and Figure 6, ldSemantic Name[Book][Book] ISBN[Book] Title[Book] Price[Book] Quantity[Book;Author] Name[Book;Publisher] NameSystem NameBookISBNTitlePriceQtyAuthorPublisherFigure 5: Books-for-Less Semantic ableFieldFieldTableFieldFieldSemantic Name[Book][Book] ISBN[Book] Quantity[Book] Title[Book] Price[Book] Description[Book;Author] Id[Book;Publisher] Id[Book;Author][Book;Author] Id[Book;Author] Name[Book;Publisher][Book;Publisher] Id[Book;Publisher] NameSystem NameBookISBNQuantityTitlePriceDescriptionAuthor idPublisher idAuthorIdNamePublisherIdNameFigure 6: Cheap Books Semantic NamesX-Spec - A Metadata Specification LanguageA X-Spec is a XML-based specification document whichencodes relational database schema information usingdictionary terms and metadata including keys, relationships,joins, and field semantics. Further, each table and field has asemantic name as previously discussed. Metadata informationon joins and dependencies are stored for query processing. AX-Spec is constructed using the specification editor componentof Unity during the capture process.The Integration Process – Forming the OntologyThe integration process combines the X-Specs retrieved fromeach data source into an integrated context view. Theintegration algorithm is a straightforward term matchingalgorithm. The same term in different X-Specs represents theidentical concept regardless of its format. The algorithmreceives as input one or more X-Specs and uses the semanticnames present to match related concepts. The integration orderis irrelevant, and the same X-Specs may be integrated severaltimes with no change. As more X-Specs are integrated, thenumber of concepts grows, but assuming the semantic namesare properly assigned, the effectiveness of the integration isunchanged. The integrated view produced for the bookdatabases is given in Figure 7.Global View TrmV (view r]IdName[Publisher]IdNameData Source Mappings (not visible)N/ACB.Book, BfL.BookCB.Book.ISBN, BfL.Book.ISBNCB.Book.Title, BfL.Book.TitleCB.Book.Price, .DescriptionCB.AuthorCB.Book.Author id, CB.Author.IdCB.Author.Name, BfL.Book.AuthorCB.PublisherCB.Book.Publisher id, CB.Publisher.IdCB.Publisher.Name, BfL.Book.PublisherFigure 7: Integrated View

The Query ProcessThe integrated view of concepts, called a context view, is ahierarchy of concepts and contexts, which map to physicaltables and fields in the underlying databases. Businesses canquery each others’ repositories by generating queries bymanipulating semantic names. The querying business is notresponsible for determining schema element mappings, joinsbetween tables in a data source, or joins across data sources.The system inserts joins based on the relationships betweenschema elements.The query processor in Unity: Determines the semantic names of concepts requestedby the query, and for each data source, determines thebest field mapping(s) for each semantic name andtheir associated tables. Given a set of fields and tables to access in a datasource, determines which joins to insert to connectdatabase tables. Generates SQL queries created in the previous steps,and transmits SQL queries and authenticationinformation to the wrapper systems for each datasource. Retrieves row results from wrapper systems, appliesreverse mappings back to semantic names, anddisplays formatted results to the query poser. Determines if row results should be unioned or joinedtogether across databases based on the presence ofcommon keys.Query ExampleGiven the two bookstores described above, we now consider atypical e-business scenario. A third book retailer (FindAllBooks) needs to locate as many copies of a book entitled “Howto Query Databases” as possible. The first step is to integratethe FindAll’s business into the ontology described earlier.2This requires the creation of an X-Spec so the results, oncelocated, can be returned. The integration algorithm mustintegrate FindAll's X-Spec into the integrated dictionary toform the ontology. FindAll can then submit a query based onthe integrated schema so it can retrieve the necessaryinformation from both stores. Authentication and securityaccess codes for each business' database are required, but this isthe responsibility of the Workflow Management component ofFigure 1. For each database, this information is stored in Unity(the adapter) so information such as its website address andauthentication information can be retrieved and transmittedautomatically.2This is not strictly required if the only business requirement isto retrieve data from the two bookstores. However, if businessworkflow is required between the three businesses, theworkflow manager (see Figure 1) must understand theontological model of all participantsFindAll now selects the quantity available of the book entitled How to Query Databases'' using the integrated context viewillustrated in Figure 7. Thus, two attributes are required fromthe integrated ontology:[Book] Title “How to Query Databases”[Book] QuantityThe query processor now uses the mappings described inFigure 7 to produce the following SQL for each data source,which are sent to the adapters (recall Figure 1) for submissionto each of the bookstore databases for processing:Cheap BooksSelect QtyFrom BookWhere Title “How to \Query Databases”;Books-for-LessSelect QuantityFrom BookWhere Title “How to \Query Databases”;This wrapper then returns results to Unity, which subsequentlyfollows the directions of the workflow manager to return theintegrated results to FindAll. Purchasing the book copies wouldrequire additional business logic that would need to be placedinto the workflow manager, but FindAll does not need to doanything further for subsequent data exchange with CheapBooks or Books-for-Less. Clearly, this is an extremelypowerful data interchange paradigm.Related Work and Architecture DiscussionMediator and wrapper systems such as Information Manifold[4] and TSIMMIS [5] answer queries across a wide-range ofdata sources. These systems construct integrated views usingdesigner-based approaches, which are mapped using a querylanguage or logical rules into views or queries on the datasources. Once an integrated view and corresponding mappingsto source views are logically encoded, wrapper systems aresystematically able to query and provide interoperabilitybetween data sources.Internet and industrial standards organizations take a morepragmatic approach to integration by standardizing thedefinition, organization, and exchange mechanisms for datacommunications. Work on capturing metadata in industry hasresulted in the formation of standardization bodies forexchanging data such as Electronic Data Interchange (EDI),Extensible Markup Language (XML [7]), and BizTalk [2].Industrial systems achieve increased automation by acceptingstandards to resolve conflicts.Unity combines standardization with algorithms for conflictresolution. By separating the specification of databasesemantics from the integration procedure, Unity implementsautomatic procedures to combine specifications and resolveconflicts. The combination of standardization with researchalgorithms to address the schema integration problem is unique.The key benefit of the architecture is that the integration of datasources is automatic once the capture processes are completed.By their nature, capture processes are partially manual, as theyrequire designers to capture semantic information in X-Specs

using the X-Spec editor. Once a capture process for a datasource is completed, it never has to be re-performed. Thus, theadvantage of the architecture is a global view is automaticallycreated once designers independently define the local views ofthe individual data sources. Further, Unity preserves fullautonomy of all data sources.The major challenge inherent in the architecture is creating thestandard dictionary. Although defining terms to representconcepts is challenging, it is not without precedent. Industrialsystems such as XML and BizTalk all rely on the acceptance ofstandard formats. Our architecture is even less restrictive asnames are standardized but not structure.Unity achieves automatic conflict resolution by using astandard dictionary to build semantic names, constructing astructurally-neutral integrated view from semantic names, andmapping semantic queries to SQL. The standard dictionaryresolves the table naming conflict and the attribute namingconflict because contexts (tables) and concepts (attributes) willnot be integrated unless they have the same semantics.Structural conflicts are resolved by mapping queries throughthe integrated view. Data level conflicts are resolvable bydefining functions, which convert between contexts, and byformally expressing context semantics.Unity is not yet a complete work. The current implementationhas shown utility in multidatabase and datawarehouseenvironments, but these are predominantly characterized bybeing “read-only”. The B2B environment will require supportfor updates at multiple data sources. Although we believesupport for updates in Unity should be a fairly easy extensionfor a single database, it is likely to prove quite challenging foran arbitrary business workflow that must atomically updatemultiple data sources. Thus, we are investigating transactionsupport for Unity in a B2B environment.Bibliography[1]M.W. Bright, A.R. Hurson, and S.H. Pakzad, “ATaxonomy and Current Issues in MultidatabaseSystems”, IEEE Computer, 25(3):50-60, March 1992.[2]Microsoft Corporatation, “BizTalk Framework 1.0 –Independent Document Specification”, TechnicalReport, Microsoft, November 1999.[3]Uniform Code Council Inc., “SIL – StandardInterchange Language”, Technical Report, January,1999.[4]T. Kirk, A. Levy, Y. Sagiv, and D. Srivastava, “TheInformation Manifold”, In AAAI Spring Symposium onInformation Gathering, 1995.[5]C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y.Papakonstantinou, J. Ullman, and M. Valiveti,“Capability based medication in TSIMMIS”, InProceedings of the ACM SIGMOD Conference onManagement o f Data, pages 564-566, June 1998.[6]A. Sheth and J. Larson, “Federated Database Systemsfor Managing Distributed, Heterogeneous andAutonomous Databases”, ACM Computing Surveys,22(3):183-236, September, 1990.[7]W3C, “Extensible Markup Language (XML) 1.0”,Technical Report, February 1998.Future Work and ConclusionsThis paper has described Unity's ability to support theimportant environment commonly referred to as “B2B”. Byutilizing Unity's philosophy of combining standardization andad hoc schema integration, an extremely powerful paradigm isachieved. Thus, the user is able to create the key component todata exchange in the B2B environment, namely an ontology fordata exchange. This ontology is based on the databasesemantics independently captured using the emerging XMLlanguage to exchange data between businesses. The paper hasillustrated some of the power of X-Specs, which store semanticnames for schema elements thereby identifying identicalc

data exchange. Architectural Framework Before discussing the details of our methodology, a B2B architecture is presented to frame the discussion. Figure 1 depicts two businesses that wish to exchange data across a network. A typical B2B scenario might be a supplier that provides some compo