
Transcription
Small or medium-scale focused research project (STREP)ICT SME-DCA Call 2013FP7-ICT-2013-SME-DCAData Publishing through the Cloud:A Data- and Platform-as-a-Service Approach to EfficientOpen Data Publication and ConsumptionDaPaaSDeliverable D2.1:Open PaaS requirements, design & architecturespecificationDate: 30.01.2014Brian Elvesæter, Dumitru Roman, Martin Fagereng Johansen,Author(s):Arne J. Berre, Marin Dimitrov, and Alex SimovDissemination level: PUWP: WP2Version: 1.0Copyright DaPaaS Consortium 2013-2015
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUDocument metadataQuality assurors and contributorsQuality assuror(s)Bill Roberts, Rick Moynihan, Amanda SmithContributor(s)DaPaaS ConsortiumVersion historyVersionDateDescription0.104.12.2013Initial outline and Table ofContents (TOC).0.209.12.2013Restructuring of deliverable 20140.9529.01.20141.030.01.2014Draft section on requirementsspecification.Updated the requirementsspecification with description ofkey roles.Updated the technology reviewsection and description ofrequirements.Consistency check and update ofsections 1 (Introduction), 2(Requirements Specification) and3.1 (High-Level Architecture of theDaPaaS Platform).Updated the architecturedescription of the Platform Layer.Updated the review of relevanttechnologies.Finalized technology review.Deliverable ready for internaltechnical review.Addressed comments by internaltechnical review. Deliverableready for quality assurors.Addressed comments by qualityassurors. Final formatting andlayout.Copyright DaPaaS Consortium 2013-2015Page 2 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUExecutive SummaryThe main goal of the DaPaaS project is to provide an integrated Data-as-a-Service (DaaS) and Platformas-a-Service (PaaS) environment, together with associated services, for open data, where 3rd partiescan publish and host both datasets and data-driven applications that are accessed by end user dataconsumers in a cross-platform manner.This document provides: An overview of the DaPaaS Platform and the relevant roles played in the DaPaaS context; The requirements for the DaPaaS Platform; An initial architecture design for the Platform Layer of the DaPaaS Platform; A state-of-the-art overview of relevant solutions and technologies for the Platform Layer andsome recommendations on reuse of existing solutions to be considered in the next phase –implementation of the first prototype.Copyright DaPaaS Consortium 2013-2015Page 3 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUTable of ContentsEXECUTIVE SUMMARY . 3TABLE OF CONTENTS. 4LIST OF ACRONYMS . 6LIST OF FIGURES . 7LIST OF TABLES . 81INTRODUCTION . 91.11.22DAPAAS PLATFORM REQUIREMENTS SPECIFICATION . 112.12.22.32.43DAPAAS OVERVIEW AND KEY ROLES . 9STRUCTURE OF THIS REPORT . 10INSTANCE OPERATOR . 11DATA PUBLISHER . 12APPLICATION DEVELOPER . 13END USER DATA CONSUMER . 15ARCHITECTURE OVERVIEW . 173.1 HIGH-LEVEL ARCHITECTURE OF DAPAAS PLATFORM . 173.2 ARCHITECTURE OF THE PLATFORM LAYER. 173.2.1User Management & Access Control . 183.2.2Data Cleaning & App Development . 193.2.3Notification . 203.2.4App Management & Deployment . 203.2.5Catalog . 213.2.6Administration. 213.3 SUMMARY OF ADDRESSED REQUIREMENTS . 224REVIEW OF RELEVANT TECHNOLOGIES FOR THE PLATFORM LAYER . 244.1 TECHNOLOGY SELECTION APPROACH . 244.2 PAAS CAPABILITIES SOLUTIONS. 264.2.1Docker . 264.2.2Cocaine. 274.2.3Deis . 284.2.4Juju . 294.2.5Cozy Cloud. 294.2.6OpenCivic . 304.2.7OpenStack . 304.2.8Ansible . 314.2.9Puppet Open Source . 324.2.10Chef . 334.2.11Nagios Core . 344.3 DATA INTEGRATION CAPABILITIES SOLUTIONS . 354.3.1Talend Open Studio for Data Integration . 354.3.2OpenRefine . 364.3.3Karma . 374.3.4Cascading . 374.3.5Data Pipes. 385SUMMARY AND OUTLOOK. 396APPENDIX A: COMMERCIAL / CLOSED SOURCE INTEGRATED DAAS & PAASSOLUTIONS . 406.16.26.36.4DATAMEER . 40SPLUNK . 40WINDOWS AZURE MARKETPLACE . 40GOODDATA . 41Copyright DaPaaS Consortium 2013-2015Page 4 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PU6.56.6TABLEAU SOFTWARE . 41INFOCHIMPS . 41Copyright DaPaaS Consortium 2013-2015Page 5 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUList of RQLSSHUMLXMLApplication Programming InterfaceComma Separated Values (format)Data-as-a-ServiceGraphical User InterfaceHypertext Transfer Protocol SecureJavaScript Object Notation (format)Platform-as-a-ServiceRepresentational state transferResource Description FrameworkService Level AgreementService Oriented ArchitectureSPARQL Protocol and RDF Query LanguageSecure ShellUnified Modeling LanguageeXtensible Markup LanguageCopyright DaPaaS Consortium 2013-2015Page 6 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUList of FiguresFigure 1: DaPaaS artefacts . 9Figure 2: Key roles in a typical DaPaaS context . 10Figure 3: Instance Operator (IO) requirements .11Figure 4: Data Publisher (DP) requirements . 12Figure 5: Application Developer (AD) requirements . 14Figure 6: End User Data Consumer (EU) requirements . 15Figure 7: High-level architecture of the DaPaaS Platform . 17Figure 8: Architecture of the Platform Layer . 18Figure 9: Basic Docker functions . 27Figure 10: Cocaine architecture . 28Figure 11: Juju administration GUI . 29Figure 12: OpenStack conceptual architecture . 31Figure 13: How Puppet works . 32Figure 14: How Chef works . 33Figure 15: Operating principle of Nagios . 34Figure 16: Talend Open Studio for Data Integration . 35Figure 17: Edit cells in OpenRefine – Common transformations . 36Figure 18: Modelling data in Karma . 37Figure 19: Cascading architecture . 38Copyright DaPaaS Consortium 2013-2015Page 7 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUList of TablesTable 1: Description of requirements from Instance Operator (IO) .11Table 2: Description of requirements from the Data Publisher (DP) . 13Table 3: Description of requirements from the Application Developer (AD) . 14Table 4: Description of requirements from the End User Data Consumer (EU). 15Table 5: User Management & Access Control . 19Table 6: App Development. 19Table 7: Notification . 20Table 8: App Management & Deployment . 21Table 9: Catalog . 21Table 10: Administration . 21Table 11: Addressed requirements by components of the Platform Layer . 22Table 12: Overview of relevant open source technologies . 25Copyright DaPaaS Consortium 2013-2015Page 8 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PU1 IntroductionThis report represents Deliverable D2.1 "Open PaaS requirements, design & architecture specification"of the DaPaaS project. This deliverable is a result of Task T2.1 "Requirements, analysis & design of theOpen Platform-as-a-Service infrastructure".The aim of this deliverable is two-fold:1. To introduce the DaPaaS platform, the relevant roles played in the DaPaaS context, and theirrequirements towards a Data- and Platform-as-a-Service infrastructure for open data;2. To provide details on the Platform Layer of the DaPaaS platform, with a focus on the architectureand evaluation of existing relevant technologies that could be reused for the implementation ofthe Platform Layer.1.1DaPaaS Overview and Key RolesThe main goal of the DaPaaS project is to provide an integrated Data-as-a-Service (DaaS) and Platformas-a-Service (PaaS) environment for open data, where 3rd parties can publish and host both datasetsand data-driven applications that are accessed by end user data consumers in a cross-platform manner.The DaPaaS project will deliver the software that enables platform operators to deploy such anenvironment in the cloud. Figure 1 below illustrates the idea that the DaPaaS software (DaaS and PaaSfunctionalities) can have several deployed instances.Figure 1: DaPaaS artefactsAs the main results of the DaPaaS project two major artefacts are expected:1. Software consisting of DaaS, PaaS, and associated services;2. One deployed instance of the Software in an XaaS manner. In the rest of this deliverable we willrefer to this deployed instance as “DaPaaS Platform”.The key roles involved in a typical DaPaaS context and their relationships to the main DaPaaS artefacts,the software and the platform, are illustrated in Figure 2 below. The roles are: The DaPaaS Developer implements DaPaaS software components and services for theintegrated DaaS and PaaS environment. During the course of the project, this role is expectedto be primarily played by the DaPaaS consortium. A deployed instance of DaPaaS software, i.e. the DaPaaS Platform, is operated and maintainedby an Instance Operator. During the course of the project, this role is played by the DaPaaSconsortium. The Data Publisher publishes data on the DaPaaS Platform which stores the data and makesit available for 3rd party application developers and end user data consumers.Copyright DaPaaS Consortium 2013-2015Page 9 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PU The Application Developer develops data-driven applications that use the data made availablevia the DaPaaS Platform. The applications are deployed and hosted in the DaPaaS Platform. End Users Data Consumers consume data resulting from the deployed applications.Figure 2: Key roles in a typical DaPaaS contextThis document outlines the requirements for the DaPaaS platform from a role-based point of view witha focus on the services and functionality required by the Instance Operator, Data Publisher, ApplicationDeveloper and End Users Data Consumer.1.2Structure of this ReportThe rest of this document is structured as follows: Section 2 describes the core requirements for the DaPaaS Platform from the perspective of thekey roles introduced above; Based on the requirements identified in Section 2, Section 3 outlines the high-level architectureof the DaPaaS Platform, and details the Platform Layer in terms of core components and theirrelationships; Section 4 provides a review of relevant open source technologies for the implementation ofPlatform Layer; Section 5 summarizes this document and provides technical recommendations for theimplementation phase of the DaPaaS project; and Appendix A provides a brief summary of selected commercial/closed source solutions thatprovide capabilities relevant to the DaPaaS Platform.Copyright DaPaaS Consortium 2013-2015Page 10 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PU2 DaPaaS Platform Requirements SpecificationIn the following subsections we use a UML use case-inspired notation and technique to describe therequirements of the DaPaaS Platform, e.g. capabilities or services that should be offered for the rolesintroduced above.2.1Instance OperatorThe Instance Operator role is played by organizations that want to operate and maintain an instance ofthe DaPaaS Platform, e.g. acting as data brokers or creating data markets in various domains (e.g.environmental domain). Figure 3 below shows the requirements the Instance Operator poses on theDaPaaS Platform.Figure 3: Instance Operator (IO) requirementsDescriptions of these requirements are given in Table 1 below.Table 1: Description of requirements from Instance Operator (IO)IDNameBrief descriptionIO-01Secure access to platformThe Instance Operator shall have secure access (e.g.HTTPS/SSH) to the platform.IO-02Platform performancemonitoringThe Instance Operator shall be able to monitor the performance(e.g. storage and memory usage, bandwidth, CPU usage, etc.).IO-03Statistics monitoring(users, data, apps, usage)The Instance Operator shall be able to retrieve statistics aboutusers (e.g. number, profiles), data (e.g. number, size), apps andusage (e.g. dataset access, data consumption, number of service calls) as a basis for e.g. billing/invoicing for the usage ofthe platform.Copyright DaPaaS Consortium 2013-2015Page 11 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUIO-04Usage accounts managementThe Instance Operator shall be able to manage user accounts(e.g. add, delete, assign roles).IO-05Policy/quota configuration and enforcementThe Instance Operator shall be able to configure usage policies,e.g. data/apps quotas per user. The platform shall ensure enforcement of these policies, e.g. support deployment of applications subject to quotas and additional restrictions.IO-06UI for Instance OperatorThe Instance Operator shall be able to access the platform services through appropriate user interface (graphical and/or console).2.2Data PublisherThe Data Publisher role is played by organizations that want to publish data via the DaPaaS Platform.Figure 4 below depicts the Data Publisher poses on the DaPaaS Platform.Figure 4: Data Publisher (DP) requirementsDescriptions of these requirements are given in Table 2 below.Copyright DaPaaS Consortium 2013-2015Page 12 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUTable 2: Description of requirements from the Data Publisher (DP)IDNameBrief descriptionDP-01Dataset importThe Data Publisher should have the ability to import opendata into the DaPaaS platform. The data is not restricted toRDF / Linked Data and it may include other formats such asCSV, JSON, etc.DP-02Data storage & queryingThe Data Publisher should have access to APIs and queryendpoints for accessing, querying and updating data storedon the platform.DP-03Dataset search & explorationThe Data Publisher should have the possibility to explore thedataset catalog & select relevant datasets.DP-04Data interlinkingThe Data Publisher should have the possibility to semi-automatically interlink data from different datasets. This appliesonly to data which is already in RDF form.DP-05Data cleaning & transformationThe Data Publisher should have the possibility to apply simple data cleanup & transformation (incl. RDFization) overlegacy data.DP-06Dataset bookmarking ¬ificationsThe Data Publisher should have possibility to subscribe todatasets and receive notifications on datasets changes.DP-07Dataset metadata management, statistics & access policiesThe Data Publisher should have possibility to specifymetadata, descriptions and access control policies for thedatasets.DP-08Data scalabilityThe platform should include mechanisms to scale to largedata volumes.DP-09Data availabilityThe platform should include mechanisms to provide highavailability of data and limited downtime.DP-10User registration & profilemanagementThe Data Publisher shall be able to register as a data publisher and gain access to the relevant DaaS services.DP-11Secure access to platformThe Data Publisher shall have secure access (e.g.HTTPS/SSH) to the platform.DP-12UI for Data PublisherThe Data Publisher shall be able to access the DaaS services through appropriate user interfaces (graphical and/orconsole).DP-13Data publishing methodology supportThe data publication process should be accompanied by atool-supported methodology outlining steps containing various data operations2.3Application DeveloperThe Application Developer role is played by Open Data application developers that for various reasons(e.g. transparency, new business models, new services) want to develop new applications and servicesaround data and want to do so as fast as possible and as easy as possible. Figure 5 below depicts therequirements the Application Developer poses on the DaPaaS Platform.Copyright DaPaaS Consortium 2013-2015Page 13 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUFigure 5: Application Developer (AD) requirementsDescriptions of these requirements are given in Table 3 below.Table 3: Description of requirements from the Application Developer (AD)IDNameBrief descriptionAD-01Access to Data Publisher services (DP-01– DP-13)The Application Developer shall have access to APIs and libraries to access, import, transform, store, query, etc., datasets tobe used in the development of applications. Basically the Application Developer has similar requirements as outlined in DP-01– DP-13. This includes also requirements for secure access tothe platform, profile management.AD-02Data exportThe Application Developer shall have the possibility to exportdata in various formats.AD-03Develop applicationsin state-of-art programming languagesThe Application Developer shall have the possibility to developapplications in the common state-of-art programming languages, e.g. Java, Scala, Go, Ruby.AD-04Configure applicationdeploymentThe Application Developer shall have the possibility to configureuse of common cloud resources, e.g. database/storage, possible also graphical widgets.AD-05Deploy and monitorapplicationThe Application Developer shall have access to a controlled application hosting environment where data-intensive applicationscan be easily deployed, as well as monitoring facilities for thedeployed applications.Copyright DaPaaS Consortium 2013-2015Page 14 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUAD-06Application metadatamanagement, statistics & access policiesThe Application Developer shall have the possibility to updatemetadata about applications (e.g. description) and retrieve statistics about the usage of the application.AD-07UI for Application DeveloperThe Application Developer shall have the possibility to accessthe relevant DaaS and PaaS services through appropriate userinterfaces (graphical and/or console).AD-08Application development methodologysupportApplication Developers should have access to a tool-supportedmethodology outlining steps for developing and deploying dataintensive applications.2.4End User Data ConsumerThe End User Data Consumer role is played by organizations or individuals that want to consume dataand applications deployed on the platform. Figure 6 below shows the requirements the End User DataConsumer poses on the DaPaaS Platform.Figure 6: End User Data Consumer (EU) requirementsDescriptions of these requirements are given in Table 4 below.Table 4: Description of requirements from the End User Data Consumer (EU)IDNameBrief descriptionEU-01User registration & profilemanagementEnd Users shall be able to register as application consumers and manage their profiles.EU-02Search & explore datasetsand applicationsEnd Users shall be able to search and explore datasetsand applications available in the platform.EU-03Datasets and applicationsbookmarking and notificationsEnd Users shall be able to bookmark and receive notifications (e.g. updates) of datasets and applications towhich they subscribe.Copyright DaPaaS Consortium 2013-2015Page 15 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PUEU-04Mobile and desktop GUIaccessEnd Users shall be able to access applications on bothmobile and desktop devices, which requires UX components to support both mobile and desktop users in an appropriate manner. The End User Data Consumers shallbe able to access the relevant platform services, e.g.,search for datasets, applications, run applications, visualize datasets, etc., through appropriate graphical user interfaces (GUIs), e.g. pie charts, time series and maps.EU-05Data export and downloadEnd Users shall have the possibility to export data in various formats and download data from the platform.EU-06High availability of dataand applicationsHigh availability of data and appsCopyright DaPaaS Consortium 2013-2015Page 16 / 41
Deliverable D2.1: Open PaaS requirements,design & architecture specificationDissemination level: PU3 Architecture OverviewThis section outlines the high-level architecture of the DaPaaS Platform (Section 3.1), and details thePlatform Layer in terms of core components and their interactions (Section 3.2).3.1High-Level Architecture of DaPaaS PlatformUX ServicesUX ServicesUX ServicesPlatform LayerPaaS tionsApplication HostingEnvironmentData LayerDaaS ServicesDaaS ServicesDaaS ServicesDatasetsOpen DataWarehouseSecurity & Access ControlUX LayerUsage MonitoringTool-supported Methodology forData Publishing/ConsumptionThe requirements outlined in the previous section imply a layered architecture consisting of a Data-asa-Service layer (Data Layer) for scalable data hosting, a Platform-as-a-Service layer (Platform Layer)for application development and hosting, and a User Experience Layer (UX Layer) for user-friendlyaccess to data and applications. These three core layers cross-cut vertical layers that are related tomethodology support for data publishing and application development,
as-a-Service (PaaS) environment for open data, where 3rd parties can publish and host both datasets and data-driven applications that are accessed by end user data consumers in a cross-platform manner. The DaPaaS project will deliver the software that enables platform operators to