Cloud Based Big Data Infrastructure:Architectural Components and AutomatedProvisioningThe 2016 International Conference on High PerformanceComputing & Simulation(HPCS2016)18-22 July 2016, Innsbruck, AustriaYuri Demchenko, University of AmsterdamHPCS2016Cloud based Big Data Infrastructure1

Outline Big Data and new concepts– Big Data definition and Big Data Architecture Framework (BDAF)– Data driven vs data intensive vs data centric and Data Science Cloud Computing as a platform of choice for Big Data applications– Big Data Stack and Cloud platforms for Big Data– Big Data Infrastructure provisioning automation CYCLONE project and use cases for cloud based scientific applicationsautomationSlipstream and cloud automation tools– Slipstream recipe example DiscussionThe research leading to these results has receivedfunding from the Horizon2020 project CYCLONEHPCS2016Cloud based Big Data Infrastructure2

Big Data definition revisited: 6 V’s of Big DataVolumeVariety kedDynamic TerabytesRecords/ArchTables, FilesDistributed Adopted in generalby NIST BD-WGHPCS2016BatchReal/near-timeProcessesStreams6 Vs ofBig Data Changing data Changing model LinkageVariabilityVelocity tworthinessAuthenticityOrigin, ReputationAvailabilityAccountabilityVeracityCloud based Big Data InfrastructureGeneric Big DataProperties Volume Variety VelocityAcquired Properties(after entering system) Value Veracity VariabilityCommonly accepted3V’s of Big Data Volume Velocity Variety3

Big Data definition: From 6V to 5 Components(1) Big Data Properties: 6V– Volume, Variety, Velocity– Value, Veracity, Variability(2) New Data Models– Data linking, provenance and referral integrity– Data Lifecycle and Variability/Evolution(3) New Analytics–Real-time/streaming analytics, machine learning and iterative analytics(4) New Infrastructure and Tools––––High performance Computing, Storage, NetworkHeterogeneous multi-provider services integrationNew Data Centric (multi-stakeholder) service modelsNew Data Centric security models for trusted infrastructure and data processingand storage(5) Source and Target– High velocity/speed data capture from variety of sensors and data sources– Data delivery to different visualisation and actionable systems and consumers– Full digitised input and output, (ubiquitous) sensor networks, full digital controlHPCS2016Cloud based Big Data Infrastructure4

Moving to Data-Centric Models and Technologies Current IT and communication technologies arehost based or host centric– Any communication or processing are bound to host/computer thatruns software– Especially in security: all security models are host/client based Big Data requires new data-centric models–––––Data location, search, accessData integrity and identificationData lifecycle and variabilityData centric (declarative) programming modelsData aware infrastructure to support new data formats and datacentric programming models Data centric security and access controlHPCS2016Cloud based Big Data Infrastructure5

Moving to Data-Centric Models and Technologies CurrentIT and communicationtechnologies areRDA developments(2016):hosthost centricData basedbecomeorInfrastructurethemselves – Any communicationor processingare modelsbound to andhost/computerPID,Metadata Registries,Dataformats thatruns softwareData Factories– Especially in security: all security models are host/client based Big Data requires new data-centric models–––––Data location, search, accessData integrity and identificationData lifecycle and variabilityData centric (declarative) programming modelsData aware infrastructure to support new data formats and datacentric programming models Data centric security and access controlHPCS2016Cloud based Big Data Infrastructure6

NIST Big Data Working Group (NBD-WG) andISO/IEC JTC1 Study Group on Big Data (SGBD) NIST Big Data Working Group (NBD-WG) is leading the development of the Big DataTechnology Roadmap -– Built on experience of developing the Cloud Computing standards fully accepted byindustrySet of documents published in September 2015 as NIST Special Publication NIST SP 1500:NIST Big Data Interoperability Framework lications/NIST.SP.1500-1.pdfVolume 1: NIST Big Data DefinitionsVolume 2: NIST Big Data TaxonomiesVolume 3: NIST Big Data Use Case & RequirementsVolume 4: NIST Big Data Security and Privacy RequirementsVolume 5: NIST Big Data Architectures White Paper SurveyVolume 6: NIST Big Data Reference ArchitectureVolume 7: NIST Big Data Technology Roadmap NBD-WG defined 3 main components of the newtechnology:–––HPCS2016The Big Data Paradigm consistsof the distribution of data systemsacross horizontally-coupledindependent resources to achievethe scalability needed for theefficient processing of extensivedatasets.Big Data ParadigmBig Data Science and Data Scientist as a new professionBig Data ArchitectureCloud based Big Data Infrastructure7

NIST Big Data Reference ArchitectureMain components of the BigData ecosystemI N F O R M AT I O N V A L U E C H A I N VisualizationAnalyticsAccessDATASWSWSWBig Data Framework ProviderProcessing Frameworks (analytic tools, etc.)Horizontally ScalableVertically ScalablePlatforms (databases, etc.)Horizontally ScalableVertically ScalableInfrastructuresHorizontally Scalable (VM clusters)Vertically ScalablePhysical and Virtual Resources (networking, computing, etc.)KEY:DATAService UseSWBig Data InformationFlowSW Tools and AlgorithmsTransferI T VA LU E C H A I NPreparationManagementCollectionSecurity & PrivacyDATADATAData ProviderBig Data Application ProviderData ConsumerSystem OrchestratorData ProviderBig Data Applications ProviderBig Data Framework ProviderData ConsumerService OrchestratorBig Data Lifecycle andApplications Provider activities CollectionPreparationAnalysis and AnalyticsVisualizationAccessBig Data Ecosystem includes allcomponents that are involved intoBig Data production, processing,delivery, and consuming[ref] Volume 6: NIST Big Data Reference Architecture. output docs.phpHPCS2016Cloud based Big Data Infrastructure8

Big Data Architecture Framework (BDAF) by UvA(1) Data Models, Structures, Types– Data formats, non/relational, file systems, etc.(2) Big Data Management– Big Data Lifecycle (Management) Model Big Data transformation/staging– Provenance, Curation, Archiving(3) Big Data Analytics and Tools– Big Data Applications Target use, presentation, visualisation(4) Big Data Infrastructure (BDI)– Storage, Compute, (High Performance Computing,) Network– Sensor network, target/actionable devices– Big Data Operational support(5) Big Data Security– Data security in-rest, in-move, trusted processing environmentsHPCS2016Cloud based Big Data Infrastructure9

onConsumerBig Data Ecosystem: General BD InfrastructureData Transformation, Data ManagementDataDelivery,VisualisationBig Data Target/Customer: Actionable/Usable DataTarget users, processes, objects, behavior, Big Data Source/Origin (sensor, experiment, logdata, behavioral data)Big Data urposeHighPerformanceComputerClustersBig Data Infrastructure Heterogeneous multi-providerinter-cloud infrastructure Data managementinfrastructure Collaborative Environment(user/groups managements) Advanced high performance(programmable) network Security analytics DB,In memory,operstional)categories: metadata,(un)structured, (non)identifiableData Management non)identifiableIntercloud multi-provider heterogeneous InfrastructureSecurity InfrastructureHPCS2016Network itoringCloud based Big Data Infrastructure10

Big Data Infrastructure and Analytics ToolsBig Data Infrastructure Heterogeneous multi-providerinter-cloud infrastructure Data management infrastructure Collaborative Environment Advanced high performance(programmable) network Security infrastructure Federated Access and DeliveryInfrastructure (FADI)Big Data AnalyticsInfrastructure/Tools High Performance ComputerClusters (HPCC) Big Data storage and databasesSQL and NoSQL Analytics/processing: Real-time,Interactive, Batch, Streaming Big Data Analytics tools andapplicationsHPCS2016Cloud based Big Data Infrastructure11

Data Lifecycle/Transformation ModelMultiple Data Models and structures Data Variety and Variability Semantic InteroperabilityData Model (1)Data Model (1)Data (inter)linking PID/OID Identification Privacy, Opacity Traceability vsOpacityData Model (4)Data Storage (Big Data SourceData Model (3)Data repurposing,Analytics re-factoring,Secondary processing HPCS2016Data Model changes along data lifecycle or evolutionData provenance is a discipline to track alldata transformations along lifecycleIdentifying and linking data–––Persistent data/object identifiers (PID/OID)Traceability vs OpacityReferral integrityCloud based Big Data Infrastructure12

Cloud Based Big Data ServicesHigh-performance and scalable computing system (e.g. Hadoop)DataSourcesData IngestData AnalysisData Visualis(data servers)(compute servers)(query servers)GeneralPurposeCloudInfrastrd1d2Source datasetCharacteristics:Massive data andcomputation on cloud, smallqueries and a)Parallelfile system(e.g., GFS,HDFS)d3Derived datasetsExamples:Search, scene completionservice, log processingCloud based Big Data Infrastructure13

Big Data Stack components and technologiesThe major structural components of the Big Data stack are grouped around the mainstages of data transformation Data ingest: Ingestion will transform, normalize, distribute and integrate to one ormore of the Analytic or Decision Support engines; ingest can be done via ingestAPI or connecting existing queues that can be effectively used for handlespartitioning, replication, prioritisation and ordering of data Data processing: Use one or more analytics or decision support engines toaccomplish specific task related to data processing workflow; using batch dataprocessing, streaming analytics, or real-time decision support Data Export: Export will transform, normalize, distribute and integrate outputdata to one or more Data Warehouse or Storage platforms; Back-end data management, reporting, visualization: will support datastorage and historical analysis; OLAP platforms/engines will support dataacquisition and further use for Business Intelligence and historical analysis.HPCS2016Cloud based Big Data Infrastructure14

Big Data StackHook into an existing queue to get copy / subsetof data. Queue handles partitioning, replication,and ordering of data, can manage backpressurefrom slower downstream componentsIngestion will transform,normalize, distribute andintegrate to one or more ofthe Analytic / DecisionEngines of choiceUse one or more Analytic /Decision engines toaccomplish specific task onthe Fast Data windowOLAP Engines tosupport data acquisitionfor later historical dataanalysis and BusinessIntelligence on entireBig Data setHPCS2016Use directIngestionAPI tocaptureentirestream atwire speedMessage/Data QueueData IngestionReal-TimeDecisions and/orResults can be fedback “up-stream” toinfluence the “nextstep”Data ProcessingStreaming Analytics NON QUERYTIME WINDOW MODELCounting, StatisticsRulesTriggersBatch processing SEARCH QUERY Decision support StatisticsReal-Time Decisions CONTINUOUS QUERYOLTP MODELProg Request-ResponseStored ProceduresExport for PipeliningData ExportData management, reporting, visualisation Data Store & Historical AnalyticsOLAP / DATA WAREHOUSE MODELPERIODIC QUERYBusiness Intelligence, ReportsUser InteractiveCloud based Big Data InfrastructureExport will transform,normalize, distributeand integrate to oneor more DataWarehouse orStorage Platforms15

Important Big Data TechnologiesMicrosoft Azure:Event HubsData FactoryStream AnalyticsHDInsightDocumentDBGoogle GCE:DataFlowBigQueryAmazon ksHPCS2016Event HubsMessage/Data QueueSamzaData Factory, Kafka, Flume, ScribeData IngestionDataFlowStream Analytics,Kinesis Storm, Cascading,S4, Crunch, SparkStreamingStreaming AnalyticsSqoop, HDFSHadoop,HDFSEMRHDInsightVoltDBBatch processing Real-Time DecisionsData ExportHDInsight, DocumentDB, Hadoop, Hive, SparkSQL, Druid, BigQuery, DynamoDB, Vertica,MongoDB, EMR, CouchDB,Data management, reporting, visualisationCloud based Big Data InfrastructureOpen parkHadoopHiveDruidMongoDBCouchDBVoltDB16

Cloud Platform Benefits for Big Data Cloud deployment on virtual machines or containers– Applications portability and platform independence, on-demand provisioning– Dynamic resource allocation, load balancing and elasticity for tasks andprocesses with variable load Availability of rich cloud based monitoring tools for collecting performanceinformation and applications optimisationNetwork traffic segregated and isolation– Big Data applications benefit from lowest latencies possible for node to nodesynchronization, dynamic cluster resizing, load balancing, and other scale-outoperations– Clouds construction provides separate networks for data traffic and managementtraffic– Traffic segmentation by creating Layer 2 and Layer 3 virtual networks insideuser/application assigned Virtual Private Cloud (VPC) Cloud tools for large scale applications deployment and automation– Provide basis for agile services development and Zero-touch servicesprovisioning– Applications deployment in cloud is supported by major Integrated DevelopmentEnvironment (IDE)HPCS2016Cloud based Big Data Infrastructure17

Cloud HPC and Big Data Platforms HPC on cloud platform– Special HPC and GPU VM instances as well as Hadoop/HPC clusters offeredby all CSPs Amazon Big Data services– Amazon Elastic MapReduce, Kinesis, DynamoDB, Regshift, etc Microsoft Analytics Platform System (APS)– Microsoft HD Insight/Hadoop ecosystems IBM BlueMix applications development platform– Includes full cloud services and data analytics services LexisNexis HPC Cluster System– Combing both HPC cluster platform and optimized data processinglanguages Variety of Open Source tools– Streaming analytics/processing tools: Apache Kafka, Apache Storm, ApacheSparkHPCS2016Cloud based Big Data Infrastructure18

LexisNexis HPCC Systems Architecture THOR is used for massive data processing in batch mode for ETL processingROXIE is used for massive query processing and real-time analyticsHPCS2016Cloud based Big Data Infrastructure19

LexisNexis HPCC Systems as an integrated OpenSource platform for Big Data AnalyticsHPCC Systems data analytics environment components and HPCC Systems architecturemodel is based on a distributed, shared-nothing architecture and contains two cluster THOR Data Refinery: Massively parallel Extract, Transform, and Load (ETL) engine that can be used forvariety of tasks such as massive: joins, merges, sorts, transformations, clustering, and scaling.ROXIE Data Delivery: Massively parallel, high throughput, structured query response engine with realtime analytics capabilityOther components of the HPCC environment: data analytics languages Enterprise Control Language (ECL): An open source, data-centric declarative programming language– The declarative character of ECL language simplifies coding– ECL is explicitly parallel and relies on the platform parallelism.LexisNexis proprietary record linkage technology SALT (Scalable Automated Linking Technology):automates data preparation process: profiling, parsing, cleansing, normalisation, standardisation of data.– Enables the power of the HPCC Systems and ECLKnowledge Engineering Language (KEL) is an ongoing development– KEL is a domain specific data processing language that allows using semantic relations betweenentities to automate generation of ECL code.HPCS2016Cloud based Big Data Infrastructure20

Cloud-powered Services Development Lifecycle:DevOps Continuous service improvementConfiguration Management & OrchestrationDevAll in oneinstanceTest/QAStaging2-tier appwith test DB2-tier app withproductiondataProd2-tier app withproduction dataMulti-AZ and HAAmazon CloudFormation Chef (or Puppet) Easily creates test environment close to realPowered by cloud deployment automation tools– Continuous development – test – integration– To enable configuration Management and Orchestration, Deployment automationCloudFormation Template, Configuration Template, Bootstrap TemplateCan be used with Puppet and Chef, two configuration and deployment management systems for clouds[ref] Building Powerful Web Applications in the AWS Cloud” by Louis loud/HPCS2016Cloud based Big Data Infrastructure21

CYCLONE Project: Automation platform for cloudbased applications Biomedical and Energyapplications Multi-cloud multi-provider Distributed and dataprocessing environment Network infrastructureprovisioning– Dedicated and virtualoverlay over Internet Automated applicationsprovisioningHPCS2016Cloud based Big Data Infrastructure22

CYCLONE Components Biomedical and Energyapplications Multi-cloud multi-provider Distributed and dataprocessing environment Network infrastructureprovisioning– Dedicated and virtualoverlay over Internet Automated applicationsprovisioningHPCS2016Cloud based Big Data Infrastructure23

SlipStream – Cloud Automation andManagement PlatformProviding complete engineering PaaS supporting DevOps processes Deployment engine “App Store” for sharing application definitions with other users “Service Catalog” for finding appropriate cloud service offers Proprietary Recipe format– Stored and shared via AppStore All features are available through web interface or RESTful API Similar to Chef, Puppet, AnsibleSupports multiple cloud platforms: AWS, Microsoft Azure, StratusLab, lHPCS2016Cloud based Big Data Infrastructure24

Bioinformatics Use Cases1.2.3.Securing human biomedical dataCloud virtual pipeline for microbial genomes analysisLive remote cloud processing of sequencing dataOn-demand bandwidth, compute as well as complex orchestration.

Functionality used for use cases deploymentThe definition of an application component consists of a series of recipesthat are executed at various stages in the lifecycle of the application. Pre-install: Used principally to configure and initialize the operatingsystem’s package management. Install packages: A list of packages to be installed on the machine.SlipStream supports the package managers for the RedHat and Debianfamilies of OS. Post-install: Can be used for any software installation that can not behandled through the package manager. Deployment: Used for service configuration and initialization. This scriptcan take advantage of SlipStream’s “parameter database” to passinformation between components and to synchronize the configuration ofthe components. Reporting: Collects files (typically log files) that should be collected atthe end of the deployment and made available through SlipStream.HPCS2016Cloud based Big Data Infrastructure26

The master node deploymentThe master node deployment script performs the following actions: Initialize the yum package manager. Install bind utilities. Allow SSH access to the master from the slaves. Collect IP addresses for batch system. Configure batch system admin user. Export NFS file systems to slaves. Configure batch system. Indicate that cluster is ready for use.HPCS2016Cloud based Big Data Infrastructure27

Example script to export NSF directoryss-display "Exporting SGE ROOT DIR."echo -ne " SGE ROOT DIR\t" EXPORTS FILEfor ((i 1; i ss-getBacterial Genomics Slave:multiplicity ; i ));donode host ss-getBacterial Genomics Slave. i:hostname echo -ne node host EXPORTS FILEecho -ne "(rw,sync,no root squash) " EXPORTS FILEdoneecho "\n" EXPORTS FILE # last for a newlineexportfs –av ss-get command retrieves a value from the parameter database.It determines the number of slaves and then loops over each one–HPCS2016Retrieves each IP address (hostname) and add it to the NFS exports file.Cloud based Big Data Infrastructure28

Questions and Discussion More information about CYCLONE project ex.htmlHPCS2016Cloud based Big Data Infrastructure29

Volume 5: NIST Big Data Architectures White Paper Survey Volume 6: NIST Big Data Reference Architecture Volume 7: NIST Big Data Technology Roadmap NBD-WG defined 3 main components of the new technology: – Big Data Paradigm – Big Data Scienc