The Roles of ETL, ESB, andData Virtualization Technologiesin Integration Landscape

Chapter 1Chapter 5Data Integration3-4Chapter 2Extract, Transform, Load ETLData Integration StrategiesCompared11Chapter 65Chapter 3Enterprise Service Bus - ESB7-8Integrating Data IntegrationStrategies12Case Studies13-15Chapter 4Data Virtualization - DV9-10The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape2

Chapter 1Data IntegrationData Silo SyndromeThe problem of data silos, which are data sources that are unable to easily share data from one to the other, has plaguedthe IT landscape for many years, and continues to do so today, despite the advents of broadband Internet, gigabitnetworking, and cloud-based storage.Data silos exists for a variety of reasons: Old systems have trouble talking with modern systems. On-premises systems have difficulty talking with cloud-based systems. Some systems only work with specific applications. Some systems are configured to be accessed by specific individuals or groups. Companies acquire other companies, taking on differently-configured systems.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape3

Chapter 1Bringing the Data TogetherThe problem with data silos is that no one can run a query across them; they must be queried separately, and the separateresults need to be added together manually, which is costly, time-consuming, and inefficient. To bring the data together,companies use one or more of the following data integration strategies:1. Extract, Transform, and Load (ETL) Processes, which copy the data from the silos and move it to a central location,usually a data warehouse2. Enterprise Service Buses (ESBs), which establish a communication system for applications, enabling them to shareinformation3. Data Virtualization, which creates real-time, integrated views of the data in data silos, and makes them available toapplications and analystsLet’s take a look at each of these in turn.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape4

Chapter 2Extract, Transform, Load - ETLETL Processes ExplainedETL Processes were the first data integration strategies, introduced as early as the 1970s.First, the data is extracted from thesource.Next, the extracted copy of the data istransformed into the format and structurerequired by its final destination.Finally, the transformed copy of thedata is loaded into its final destination,be it an operational data store, a datamart, or a data warehouse.Some processes do the transformation in the final step, and are therefore called “ELT processes,” but the basic concept isthe same.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape5

Chapter 2Pros and Cons of ETL Processes ETL processes are efficient andeffective at moving data in bulk. The technology is well understoodand supported by establishedvendors. Moving data is not always the best approach, as this results in a new repository that needs to be maintained. Large organizations can have thousands of ETL processes running eachnight, synchronized by scripts that aredifficult to modify if needed. Typically, ETL processes are not collaborative; the end user needs to wait untilthe data is ready. ETL processes cannot handle today’sdata volumes and complex data types. ETL tools have features thatsufficiently support bulk/batch datamovement. Most organizations have ETLcompetencies in-house.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape6

Chapter 3Enterprise Service Bus - ESBsESBs ExplainedESBs, introduced in 2002, use a message bus to exchange information between applications. With a communication bussitting between the applications, they can talk to each other by talking to the bus. This decouples systems, and allows themto communicate without depending on, or even knowing about, other systems on the bus. This forms the underpinnings ofservice oriented architecture (SOA), in which applications can easily share services across an organization. ESBs were bornout of the need to move away from point-to-point integration, which, like ETL scripts, are hard to maintain over time.J2EE ApplicationPackagedApplication &Legacy SystemsDatabaseENTERPRISE SERVICE BUSPartner SystemWeb ServiceThe Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape7

Chapter 3The Pros and Cons of ESBs Applications are decoupled. However, ESBs cannot integrateapplication data to deliver on analyticaluse cases. Queries are static and can only bescheduled; ESBs do not easily supportad-hoc queries. Database queries are restrictedto one source at a time; joins andother multiple-source functions areperformed in memory, which drainsresources. However, ESBs are only suitable foroperational use cases, which involvesmall result sets. They can be used to orchestratebusiness logic using messageflows. ESB technology is mature, and isprovided by established vendors. ESBs can address operational sce-narios by using messages to triggerevents.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape8

Chapter 4Data Virtualization - DVData Virtualization ExplainedData virtualization creates integrated views of data drawn from disparate sources, locations, and formats, without replicatingthe data, and delivers these views, in real time, to multiple applications and users. Data virtualization can draw from a widevariety of structured, semi-structured, and unstructured sources, and can deliver to a wide variety of consumers.Since no replication is involved, the data virtualization layer contains no source data; it only contains the metadata required toaccess each of the applicable sources, as well as any global instructions that organizations may wish to implement, such assecurity or governance controls.Users and applications query the data virtualization layer, which in turn gets the data from the various sources. Thedata virtualization layer abstracts users and applications from the complexities of access, and to all consumers, the datavirtualization layer appears as a single, unified repository.Publishesthe data to applications3DATA CONSUMERSAnalyticalOperationalEnterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, UsersMultiple Protocols,FormatsCombinesrelated data into viewsto disparate sources1Request/Reply,Event DrivenSecureDeliveryDATA VIRTUALIZATION2CONNECTSQL,MDXConnectsQuery, Search,BrowseMore StructuredWebServicesCOMBINEPUBLISHBig DataAPIsDISPARATE DATA SOURCESWeb Automationand IndexingLess StructuredDOCWDatabases & Warehouses, Cloud/Saas Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape9

Chapter 4The Pros and Cons of Data Virtualization Seamlessly federates two or more disparatedata sources (makes them appear andfunction as one), including a mix ofstructured and unstructured sources. Lack of support for bulk/batch datamovement which might be required bya few use cases. Adds value added features such asintelligent real-time query optimization,caching, in-memory processing, and customoptimization strategies based on sourceconstraints, application needs, or networkawareness Via an API, any primary, derived, integratedor virtual data source can be madeaccessible in a different format or protocolthan the original, with controlled access, inminutes All data is accessible through a single virtuallayer, which quickly exposes redundancy,consistency, and data quality issues, andenables the application of universal, end-toend governance and security controls.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape10

Chapter 5Data Integration Strategies ComparedBelow is a summary of several data integration use cases, indicating which strategies are best suited to each task. In thenext chapter, we’ll discuss how two strategies can work together in support of various use cases.Use CaseDV*ETL**ESB***Moving data into EDW or ODSMigrating EDW (to Cloud)Data UnificationCutomer 720ºReal-time insightsVirtual Data MartsPhysical Data MartsAgile Reporting (from EDW other sources)Logical Data WarehouseData Warehouse OffloadingApplication SynchronizationMetadata Discovery and EnrichmentSelf-Service AnalyticsETL “seeding” (decouple ETL from sources)Event-Driven Workflows*DV: Data Virtualization.**ETL: Extract, Transform, Load.***ESB: Enterprise Service Bus.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape11

Chapter 6Integrating Data Integration StrategiesOf each of the three strategies, data virtualization is most adaptable in working with other strategies, since it supports sucha wide variety of sources and targets. Let’s take a look at how data virtualization works with ETL processes and ESBs.Data Virtualization and ETL Processes. ETL processes were designed for moving data into datawarehouses and similar environments, and they are particularly well suited to this task. But ETL processescannot easily support cloud-based sources. Data virtualization can complement ETL processes in thefollowing ways: Seamlessly connecting on-premises with cloud data sources without the need to consolidate datain a single repository. Enabling the migration from on-premises to cloud-based systems without interrupting businesscontinuity. Data Warehouse offloading in which data virtualization not only helps with the offloading process,but also unifies data across the traditional data warehouse and the new repository such asHadoop, AWS S3 or a Cloud-based data store. Real-time integration of disparate data sources. Replacing ETL processes with data virtualization where faster access to data is necessary.Data Virtualization and ESBs. Data virtualization can complement an ESB and enhance its performance.Adding new sources to an ESB can be complex; sources like relational databases, Web or cloud-basedsources, flat files, or email messages are not immediately enabled for the service oriented architecture(SOA) that the ESB supports. To streamline this process, all sources that the ESB cannot handle can beunified by the data virtualization layer before being passed to the ESB. This architecture exploits the bestqualities of both technologies: Data virtualization unifies disparate sources, and ESBs deliver the criticalmessages to support the business process.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape12

Chapter 7Case StudiesLogitech: Achieving a Successful Cloud Migration by Complementing an ETL System with DataVirtualizationLogitech is a Swiss global provider of personal computer and tablet accessories. For many years, the company had beendeveloping and delivering data services for analytics using on-premises systems, integrated via ETL processes.But provisioning data services for business users was proving to be reactive, time consuming, and inefficient. To overcomethese limitations, Logitech moved IT operations to the cloud. However, some data sources remained on-premises, soLogitech needed a solution that could seamlessly integrate all of its on-premises, ETL, and cloud components.CONSUMING APPLICATIONS AND PLATFORMSTableauPentaho BAOBIEECUBESNLP EngineAlexa PlatformData ServicesExcelPOSMDMAWS SparkERPGitHubDRMSFDCAWS RedshiftAWS S3AWS EMRAWS RDSLogitech leveraged the DenodoPlatform, hosted on Amazon AWS, toestablish a data virtualization layerthat integrates these sources. Aftercreating a single, consistent datastore, the Denodo Platform feedsanalytics and reporting applicationssuch as Tableau, Pentaho BA, and webservices. In the Logitech infrastructure,the Denodo Platform has becomethe single source of truth, feeding theentire consumption layer.JSONAWS GlacierAMAZON WEB SERVICESINTERNAL AND EXTERNALDATA SOURCESThe Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape13

Leading National Life Insurance Firm: Enhancing an ESB with Data VirtualizationA Leading National Life Insurance Firm was integrating data from a variety of heterogeneous sources via an ESB, butstakeholders could not easily change input parameters, adding complexity and latency to the company’s Enterprise DataMarketplace, an in-house data mart.BUSINESS SOLUTIONSAccessInformation-as-a-ServiceBI, CPM and ReportingPortal & DashboardsApplicationsEnterpriseData MarketplaceENTERPRISE DATA SERVICE REGISTRYENTERPRISE DATASERVICE REGISTRYStandard metadata andenterprise data servicesMeta DataScheduling & DeliveryUsage StatsVIRTUAL DATA LAYERDATA VIRTUALIZATIONAbstract layer fordata servicesVirtual Data MartsReuse DataServicesVirtual OperationalData StoresDISPARATE DATAAny Source,any FormatFilesBig DataNoSQLPackaged AppRDBMSWeb ServicesA Leading National Life Insurance Firm deployed the Denodo Platform, which established a virtual data layer that the DataMarketplace UI can access via a web service. The data virtualization layer unifies the data from the heterogeneous sourcesbefore passing it to the ESB, in full support of the company’s existing workflows, while also enabling stakeholders todynamically change query parameters and other functions.The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape14

Digital Realty: Replacing ETL Processes with Data VirtualizationDigital Realty is a provider of data center acquisition,ownership, development, and operations, as well as ofcolocation services. For data integration, the companywas making extensive use of ETL processes, but feltthat these systems were negatively impacting theefficiency and speed with which business users couldaccess information.ANALYSIS AND REPORTINGGOVERNED DATA LAYERDigital Realty replaced the majority of its ETL processeswith a single data virtualization layer enabled by theDenodo Platform including ETL processes for MDM.In addition, the data virtualization layer seamlesslyaggregates a broad and diverse set of disparatesources to feed Digital Realty’s Birst-based dashboards,enabling executives to create financial and operationalreports much more easily and quickly.BUSINESS SYSTEMSProvisioningCRMFinancePortalsFacility ManagementService DeckHRPayroll.Visit Email [email protected] VisitThe data virtualization layer improved dataspeed-to-delivery fivefold, and enabled Digital Realtyto reduce costly ETL processes by more than 90%.By passing all data through a unifying layer, DigitalRealty was also able to implement robust governanceprotocols, with granular control over data lineage. EmailThe Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape15

Platform, hosted on Amazon AWS, to establish a data virtualization layer that integrates these sources. After creating a single, consistent data store, the Denodo Platform feeds analytics and reporting applications such as Tableau, Pentaho BA, and web services. In the Logitech infrastructure, the Denodo Platform has become