Transcription

Overview of Tools in the Hadoop EcosystemLecture BigData AnalyticsJulian M. [email protected] of Hamburg / German Climate Computing Center (DKRZ)2017-12-22Disclaimer: Big Data software is constantly updated, code samples may be outdated.

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryOutline1Hadoop Ecosystem2User/Admin Interfaces3Workflows4SQL Tools5Other BigData Tools6Machine Learning7SummaryJulian M. KunkelLecture BigData Analytics, WiSe 17/182 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryHortonworksScreenshot from [40]Additionally: Hortonworks offers support, serviceBuild with open-sourceJulian M. KunkelLecture BigData Analytics, WiSe 17/183 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryCloudera Enterpise Hadoop Ecosystem [25]Cloudera offers support, services and tools around HadoopUnified architecture: common infrastructure and data pool for toolsBuild with open-source tools, some own tools for management,encryptionSource: [26]Julian M. KunkelLecture BigData Analytics, WiSe 17/184 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummarySupporting Tools1Ambari: A Tool for Managing Hadoop ClustersHue: Manage „BigData“ projects in a browserZooKeeper: coordination/configuration service for servicesSqoop: ETL between HDFS and structured data storesOozie: Workflow scheduler (schedules/triggers workflows)Falcon: Data governance engine for data pipelinesFlume: collecting, aggregating and moving large streaming event dataKafka: publish-subscribe distributed messaging systemknox: REST API gateway (for all services)Ranger: Integrate ACL permissions into Hadoop (ecosystem)Slider: YARN application supporting monitoring and dynamic scaling ofnon-YARN apps1https://hadoop.apache.org/Julian M. KunkelLecture BigData Analytics, WiSe 17/185 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryAmbari: A Tool for Managing Hadoop ClustersConvenient tool managing 10 Apache toolsSupports installation and managementDealing with data dependenciesService startupMonitoring of health and performance(Re)configuration of servicesJulian M. KunkelLecture BigData Analytics, WiSe 17/186 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryManagement with Ambari: DashboardScreenshot from the WR-cluster AmbariJulian M. KunkelLecture BigData Analytics, WiSe 17/187 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryManagement with Ambari: ConfigurationJulian M. KunkelLecture BigData Analytics, WiSe 17/188 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryKnox: Security for Hadoop [22]REST API Gateway for Hadoop ecosystem servicesSupports: HDFS, Hcatalog, HBase, Oozie, Hive, Yarn, StormSupports multiple clustersProvides authentication, federation/SSO, authorization, auditingEnhances security providing central control and protectionSSL encryptionAuthentication: LDAP, Active Directory, KerberosAuthorization: ACL’s (user, group, IP) on service level2Source: [22]Julian M. KunkelLecture BigData Analytics, WiSe 17/189 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryExample Accesses via the REST API [22]List a HDFS directory1curl -i -k -u guest:guest-password -X GET, v1/?op LISTSTATUS’Example response1234HTTP/1.1 200 OKContent-Type: application/jsonContent-Length: 450Server: h":0,, "modificationTime":1350595859762, "owner":"hdfs", "pathSuffix":"apps",, "permission":"755", 0,"blockSize":0, "group":"mapred","length":0,, "modificationTime":1350595874024, "owner":"mapred","pathSuffix":"mapred",, "permission":"755", "replication":0,"type":"DIRECTORY"},]}}Julian M. KunkelLecture BigData Analytics, WiSe 17/1810 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryHue [12]: Lightweight Web Server for HadoopManage BigData projects in a browserSupports: Hadoop ecosystemHDFS, Pig, Sqoop, Hive, Impala, MapReduce, Spark, .FeaturesData upload/downloadManagement of HCatalog tablesQuery editor (Hive, Pig, Impala)Starting and monitoring of jobsJulian M. KunkelLecture BigData Analytics, WiSe 17/1811 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryHue: Lightweight Web Server for HadoopMonitoring Oozie Workflows (Live system on gethue.com)Julian M. KunkelLecture BigData Analytics, WiSe 17/1812 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryHue: Lightweight Web Server for HadoopFile browser (Live system on gethue.com)Julian M. KunkelLecture BigData Analytics, WiSe 17/1813 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryHue: Lightweight Web Server for HadoopQuery editor (Live system on gethue.com)Julian M. KunkelLecture BigData Analytics, WiSe 17/1814 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryHue: Lightweight Web Server for HadoopVisualizing query results in diagrams (Live system on gethue.com)Julian M. KunkelLecture BigData Analytics, WiSe 17/1815 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryZeppelin [39]Web-based notebook for interactive data analyticsAdd code snippetsArrange themExecute themVisualizes resultsSupports Spark, Scala, Pig, SQL, Python, R, Hive, Shell, .Collaborative environment for multiple usersCan export paragraph links for embedding into a webpageJulian M. KunkelLecture BigData Analytics, WiSe 17/1816 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryZeppelinJulian M. KunkelLecture BigData Analytics, WiSe 17/1817 / 49

Hadoop EcosystemUser/Admin Interfaces1Hadoop Ecosystem2User/Admin Interfaces3Workflows4SQL Tools5Other BigData Tools6Machine Learning7SummaryJulian M. KunkelWorkflowsSQL ToolsOther BigData ToolsLecture BigData Analytics, WiSe 17/18Machine LearningSummary18 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryOozie [15, 16]Scalable, reliable and extensible workfow schedulerJobs are DAGs of actions specified in XML workflowsActions: Map-reduce, Pig, Hive, Sqoop, Spark, Shell actionsWorkflows can be parameterizedTriggers notifications via HTTP GET upon start/end of a node/jobAutomatic user-retry to repeat actions when fixable errors occurMonitors a few runtime metrics upon executionInterfaces: command line tools, web-service and Java APIsIntegrates with HCatalogCoordinator jobs trigger start of jobsBy time schedulesWhen data becomes availableRequires polling of HDFS (1-10 min invervalls)With HCatalog’s publish-subscribe, jobs can be started immediatelyCan record events for service level agreementJulian M. KunkelLecture BigData Analytics, WiSe 17/1819 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryWorkflows [16]A workflow application is a ZIP file to be uploadedIncludes workflow definition and coordinator jobBundles scripts, JARs, libraries needed for executionWorkflow definition is a DAG with control flow and action nodesControl flow: start, end, decision, fork, joinAction nodes: whatever to executeVariables/Parameters3Default values can be defined in a config-default.xml in the ZIPExpression language functions help in parameterization1Basic functions: timestamp(), trim(), concat(s1, s2)Workflow functions: wf:errorCode( action node )Action specific unters"]["FILE BYTES READ"]Coordinator job is also an XML file3They are used with with {NAME/FUNCTION}, e.g., {timestamp()}Julian M. KunkelLecture BigData Analytics, WiSe 17/1820 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryCoordinator Jobs [17]App which periodically starts a workflow (every 60 min)1 coordinator-app name "MY APP" frequency "60" start "2009-01-01T05:00Z" end "2009-01-01T06:00Z" timezone "UTC", xmlns "uri:oozie:coordinator:0.1" 2 action 3 workflow !-- here the workflow is not further defined -- 4 app-path hdfs://localhost:9000/tmp/workflows /app-path 5 /workflow 6 /action 7 /coordinator-app Every 24h check if dependencies for a workflow are met, then run it1 coordinator-app name "MY APP" frequency "1440" start "2009-02-01T00:00Z" end "2009-02-07T00:00Z" . 2 datasets -- define a dataset, that is checked for existence -- 3 dataset name "input1" frequency "60" initial-instance "2009-01-01T00:00Z" timezone "UTC" 4 uri-template hdfs://localhost:9000/tmp/revenue feed/ {YEAR}/ {MONTH}/ {DAY}/ {HOUR} /uri-template 5 /dataset 6 /datasets 7 input-events -- we depend on the last 24 hours input data -- 8 data-in name "coordInput1" dataset "input1" 9 start-instance {coord:current(-23)} /start-instance 10 end-instance {coord:current(0)} /end-instance 11 /data-in 12 /input-events 13 action 14 workflow 15 app-path hdfs://localhost:9000/tmp/workflows /app-path 16 /workflow 17 /action 18 /coordinator-app Julian M. KunkelLecture BigData Analytics, WiSe 17/1821 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryExample Oozie Workflow [13]Three actions: Execute pig script, concatenate reducer files, upload files remotely via ssh1 workflow-app xmlns "uri:oozie:workflow:0.2" name "sample-wf" 2 start to "pig" / 3 action name "pig" 4 pig job-tracker {jobTracker} /job-tracker 5 name-node {nameNode} /name-node 6 prepare delete path " {output}"/ /prepare 7 configuration 8 property name mapred.job.queue.name /name value {queueName} /value /property 9 property name mapreduce.fileoutputcommitter.marksuccessfuljobs /name value true /value /property 10 /configuration 11 script {nameNode}/projects/bootcamp/workflow/script.pig /script 12 param input {input} /param 13 param output {output} /param 14 file lib/dependent.jar /file 15 /pig ok to "concatenator" / error to "fail" / -- the concatenator action is not shown here -- 16 /action 1718 action name "fileupload" 19 ssh host localhost /host 20 command /tmp/fileupload.sh /command 21 args {nameNode}/projects/bootcamp/concat/data- {fileTimestamp}.csv /args args {wf:conf("ssh.host")} /args 22 capture-output/ /ssh 23 ok to "fileUploadDecision" / error to "fail"/ 24 /action 2526 decision name "fileUploadDecision" -- check the exit status of the file upload -- 27 switch case to "end" {wf:actionData(’fileupload’)[’output’] ’0’} /case default to "fail"/ /switch 28 /decision 2930 kill name "fail" message Workflow failed, error message[ {wf:errorMessage(wf:lastErrorNode())}] /message /kill 31 end name "end" / 32 /workflow-app Julian M. KunkelLecture BigData Analytics, WiSe 17/1822 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryFalcon [11,13]Feed (data set) management and processing systemSimplifies dealing with many Oozie jobsSupports data governanceDefine and run data pipelines (management policies)Monitor data pipelinesTrace pipelines to identify dependencies and perform auditsData model defines entities describing policies and pipelinesClusters define resources and interfaces to useFeeds define frequency, data retention, input, outputs, retry and useclusters (multiple for replication)Process: processing task, i.e., Oozie workflow, Hive or Pig scriptFeaturesSupports reuse of entities for different workflowsEnables replication across clusters and data archivalSupports HCatalogNotification of users upon availability of feed groupsJulian M. KunkelLecture BigData Analytics, WiSe 17/1823 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryFalcon: High-level ArchitectureSource: [11]Julian M. KunkelLecture BigData Analytics, WiSe 17/1824 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryFalcon: Example PipelineSource: [11]Julian M. KunkelLecture BigData Analytics, WiSe 17/1825 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryFalcon: Example Process Definition [11, 293031 ?xml version "1.0" encoding "UTF-8"? !-- Sample process. Runs at 6th hour every day. Input: last day hourly data. Output: for yesterday process name "SampleProcess" cluster name "wr" / frequency days(1) /frequency -- validity start "2015-04-03T06:00Z" end "2022-12-30T00:00Z" timezone "UTC" / inputs input name "input" feed "SampleInput" start "yesterday(0,0)" end "today(-1,0)" / /inputs outputs output name "output" feed "SampleOutput" instance "yesterday(0,0)" / /outputs properties property name "queueName" value "reports" / property name "ssh.host" value "host.com" / property name "fileTimestamp" value " {coord:formatTime(coord:nominalTime(), ’yyyy-MM-dd’)}" / /properties workflow engine "oozie" path "/projects/bootcamp/workflow" / retry policy "backoff" delay "minutes(5)" attempts "3" / -- How to check and handle late arrival of input data-- late-process policy "exp-backoff" delay "hours(1)" late-input input "input" workflow-path "/projects/bootcamp/workflow/lateinput" / /late-process /process Julian M. KunkelLecture BigData Analytics, WiSe 17/1826 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryAtlas [23]A framework for platform-agnostic data governanceExchange metadata with other toolsAudit operations, explore history of data and metadataSupport lifecycle management workflows built with FalconSupport Ranger access control (ACL’s)Source: [23]Julian M. KunkelLecture BigData Analytics, WiSe 17/1827 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummarySqoop [18, 19]Transfers bulk data between Hadoop and RDBMS, eitherOne/multiple tables (preserving their schema)Results of a free-form SQL queryUses MapReduce to execute import/export jobsParallelism is based on splitting one column’s valueValidate data transfer (comparing row counts) for full tablesSave jobs for repeated invocationMain command line tool sqoop, more specific tools sqoop*Julian M. KunkelLecture BigData Analytics, WiSe 17/1828 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryFeatures [19]Import FeaturesIncremental import (scan and add only newer rows)File formats: CSV, SequenceFiles, Avro, ParquetCompression supportOutsource large BLOBS/TEXT into additional filesImport into Hive (and HBase)Can create the table schema in HCatalog automaticallyWith HCatalog, only CSV can be importedExport FeaturesBulk insert: 100 records per statementPeriodic commit after 100 statementsJulian M. KunkelLecture BigData Analytics, WiSe 17/1829 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryImport Process [19]Read the schema of the source tableCreate a Java class representing a row of the tableThis class can be used later to work with the dataStart MapReduce to load data parallel into multiple filesThe number of mappers can be configuredMappers work on different values of the splitting columnThe default splitting column is the primary keyDetermines min and max value of the keyDistributes fixed chunks to mappersOutput status information to the MapReduce job trackerJulian M. KunkelLecture BigData Analytics, WiSe 17/1830 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryExample Imports [19]123# Import columns from "foo" into HDFS to /home/x/foo (table name is appended)# When not specifying any columns, all columns will be imported. sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST --columns, "matrikel,name" --warehouse-dir /home/x --validate4567# We’ll use a free-form query, it is parallelized on the split-by column# The value is set into the magic CONDITIONS variable sqoop import --query ’SELECT a.*, b.* FROM a JOIN b on (a.id b.id) WHERE, CONDITIONS’ --split-by a.id --target-dir /user/foo/joinresults8910# To create the HCatalog table use --hcatalog-table or --hive-import# See [19] for details of the available optionsJulian M. KunkelLecture BigData Analytics, WiSe 17/1831 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummarySlider [20]Is a YARN application that manages non-YARN apps on a cluster Utilize YARN for resource managementEnables installation, execution, monitoring and dynamic scalingCommand line tool sliderApps are installed and run from a packageTarball with well-defined structure [21]Scripts for installing, starting, status, .Example packages: jmemcached, HBaseSlider is currently extended to deploy Docker images (Tech preview)Julian M. KunkelLecture BigData Analytics, WiSe 17/1832 / 49

Hadoop EcosystemUser/Admin Interfaces1Hadoop Ecosystem2User/Admin Interfaces3Workflows4SQL Tools5Other BigData Tools6Machine Learning7SummaryJulian M. KunkelWorkflowsSQL ToolsOther BigData ToolsLecture BigData Analytics, WiSe 17/18Machine LearningSummary33 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryDrill [10, 29, 30]Software framework for data-intensive distributed applicationsData model: relational (ANSI SQL !) schema-free JSONAnalyse data in-situ without data movementExecute one query against multiple NoSQL datastoresDatastores: HBase, MongoDB, HDFS, S3, Swift, local filesFeaturesREST APIsColumnar execution engine supporting complex dataLocality-aware executionCost-based optimizer pushing processing into datastoreRuntime compilation of queries1234# Different datastores, localstorage, mongodb and s3SELECT * FROM dfs.root.‘/logs‘;SELECT country, count(*) FROM mongodb.web.users GROUP BY country;SELECT timestamp FROM s3.root.‘users.json‘ WHERE user id ’max’;567# Query JSON: access the first students age from private data (a map)SELECT student[0].private.AGE, FROM dfs.’students.json’;Julian M. KunkelLecture BigData Analytics, WiSe 17/1834 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryCloudera Impala [25, 26]Enterprise analytic databaseUtilizes HDFS, HBase and Amazon S3Based on Google Dremel like Apache DrillWritten in C , JavaMassively-parallel SQL engineSupports HiveQL and subset of ANSI-92 SQLUses LLVM to generate efficient code for queriesJulian M. KunkelLecture BigData Analytics, WiSe 17/1835 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryApache Metamodel [43]Provides a Java based SQL-alike interface to various data sourcesCSV, SQL dbs, JSON, HBase, MongoDBQuery [43]1 DataContext dataContext DataContextFactory.create[TypeOfDatastore](.);2 DataSet dataSet dataContext.query(), ).eq("Java").and("enhances data access").eq(true).execute();Update [43]1 dataContext.executeUpdate(new UpdateScript() {2public void run(UpdateCallback callback) {3// CREATE a table4Table table CHAR).execute();67// INSERT INTO table8callback.insertInto(table).value("id", 1).value("name", "John "name", "Jane D.").execute();1011// UPDATE table12callback.update(table).value("name","Jane Doe").where("id").eq(2).execute();1314// DELETE FROM ).execute();16}17 });Julian M. KunkelLecture BigData Analytics, WiSe 17/1836 / 49

Hadoop EcosystemUser/Admin Interfaces1Hadoop Ecosystem2User/Admin Interfaces3Workflows4SQL Tools5Other BigData Tools6Machine Learning7SummaryJulian M. KunkelWorkflowsSQL ToolsOther BigData ToolsLecture BigData Analytics, WiSe 17/18Machine LearningSummary37 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryKafka [41]Publish-subscribe distributed messaging systemProducer publishes message for a given topicConsumer subscribes to topic and receives dataSimple: Consumer has to remember its read position (offset)A data source for Storm, HBase, Spark, .Use cases – support data ingestion:GPS data from truck fleet, sensor dataError logs from cluster nodes, web server activityFeaturesParallel, fault-tolerant server system (a server is called broker)Source: [42]Julian M. KunkelLecture BigData Analytics, WiSe 17/1838 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummarySolr [10, 31]Full-text search and indexing platformREST API: index documents and query via HTTPQuery response in JSON, XML, CSV, binaryFeaturesData can be stored on HDFSHigh-availability, scalable and fault tolerantDistributed searchFaceted classification: organize knowledge into a systematic order using(general or subject-specific) semantic categories that can be combined for afull classification entry [10]Geo-spatial searchCaching of queries, filters and documentsUses lucene library for searchVery similar: Elasticsearch [33], http://solr-vs-elasticsearch.com/Julian M. KunkelLecture BigData Analytics, WiSe 17/1839 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryExample Query [32]Identifying available facets terms and number of docs for each1 curl t json&indent true&q *:*&rows 0&facet true& facet.field manu id sResponse1 s":{ /* Parameters of the query */6"facet":"true", "indent":"true", "q":"*:*", "facet.field":"manu id s", 2990,"start":0,"docs":[]}, /* number of documents found */9"facet counts":{10"facet queries":{},11"facet fields":{ /* the available facets and number of documents */12"manu id s":["corsair",3, "belkin",2, "canon",2, "apple",1, "asus",1, "ati",1, "boa",1, "dell",1, "eu",1, "maxtor",1,, "nor",1, "uk",1, "viewsonic",1, "samsung",0]},13"facet dates":{},14"facet ranges":{},15"facet intervals":{}}16 }Julian M. KunkelLecture BigData Analytics, WiSe 17/1840 / 49

Hadoop EcosystemUser/Admin Interfaces1Hadoop Ecosystem2User/Admin Interfaces3Workflows4SQL Tools5Other BigData Tools6Machine Learning7SummaryJulian M. KunkelWorkflowsSQL ToolsOther BigData ToolsLecture BigData Analytics, WiSe 17/18Machine LearningSummary41 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryMahout [34]Framework for scalable machine learningCollbarorative filteringClassificationClusteringDimensionality reductionRecommenderhistory: user purchases all purchases recommendations (user)Computation on Spark, MapReduce, H2O engines [36]Can also use a single machine without HadoopAlgorithm availability depends on the backendBindings for Scala language [35]Provide distributed BLAS, Row Matrix (DRM)R-like DSL embedded in ScalaAlgebraic optimizerJulian M. KunkelLecture BigData Analytics, WiSe 17/1842 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryRecommender Architecture1Collect user interactionsn x (user-id, item-id)2Learning:1. Itemsimilarity createsitem, list-of-similar-items2. Store those tuples inthe search engine3Query search enginewith n latest userinteractions4If they occur in thelist-of-similar-items,recommend itemSource: [36]Julian M. KunkelLecture BigData Analytics, WiSe 17/1843 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryTensorFlow [44]Platform independent opens-source library for machine learningGPU acceleratedDeveloped by GoogleBrainTensor: Multidimensional array with rank dimensionsWorkflowDefine tensors (or placeholders)Build execution graph based on equations and optimizers (math)Start execution (bind dummy variables)Big communityPyTorch alternative from Facebook with dynamic graph creationJulian M. KunkelLecture BigData Analytics, WiSe 17/1844 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryTensorFlow: Example for MNIST Dataset [47]1234567# Here build a neural network of a linear classifier to estimate classesimport tensorflow as tf# Create two tensors holding features and labels for MNIST# 28x28 images, number is unknown (None)x tf.placeholder(tf.float32, [None, 784], name ’Feature’) # name is useful in the GUI# Create tensor to recognize 0-9 10 classesy tf.placeholder(tf.float32, [None, 10], name ’Label’)891011# Values to compute for a linear classifier, for each pixel compute chance it is 0-9W tf.Variable(tf.zeros([784, 10]),name ’Weights’)b tf.Variable(tf.zeros([10]),name ’Bias’)1213141516171819# Formulate solution with tensors# Softmax returns a probability for each class that sums up to 1; here 10 classespred tf.nn.softmax( tf.matmul(x, W) b )# Cost function for gradient descent; here is the cross entropycost tf.reduce mean(-tf.reduce sum(y * tf.log(pred), reduction indices 1))# Learning algorithm to train the networkoptimizer tf.train.GradientDescentOptimizer(learning rate).minimize(cost)202122# Start execution, assume features and labels have been loadedsess.run([optimizer, cost], feed dict {x: features, y: labels})Julian M. KunkelLecture BigData Analytics, WiSe 17/1845 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryTensorBoard [45]Browser visualization for TensorflowModel graphOperations can be given namesName scopes organize operations togetherResults: scalar, events, histogram, multi-dimensional, audio, images, .e.g., visualize accuracy over training stepsEmbeddings: a mapping from discrete objects, e.g., words, to vectors of realnumbersReads (binary) output from a log directory, can be multiple runs!Uses protocol buffers for serializationAnalyze hyperparameter searchJulian M. KunkelLecture BigData Analytics, WiSe 17/1846 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryTensorBoard: Model Graph VisualizationJulian M. KunkelLecture BigData Analytics, WiSe 17/1847 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummaryTensorBoard: Result VisualizationJulian M. KunkelLecture BigData Analytics, WiSe 17/1848 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine LearningSummarySummaryThe (Apache) Hadoop community is activeSoftware responsibilities:Hadoop deployment and cluster managementData management and provenanceSecurityAnalysisAutomation (scheduling, data ingestion)One goal: simple usageAlternative user interfacesResearch of domain-specific languages (XML based or language embedded)Many software packages are used but still in Apache incubator (beta)Julian M. KunkelLecture BigData Analytics, WiSe 17/1849 / 49

Hadoop EcosystemUser/Admin InterfacesWorkflowsSQL ToolsOther BigData ToolsMachine uction- apache- falcon- ie/wiki/Oozie- Coord- Use- rg/docs/slider specs/application orks.com/blog/apache- atlas- project- proposed- for- hadoop- ://www.cloudera.com/content/cloudera/en/products- and- ntent/cloudera/en/products- and- services/cloudera- che.org/docs/json- data- /solr- vs- www.scala- sers/recommender/quickstart.htmlFree Ebook: https://www.mapr.com/practical- machine- ://hortonworks.com/products/data- et g/get started/get startedKunkelLecture BigData Analytics, WiSe 17/1850 / 49

Hadoop Ecosystem User/Admin Interfaces Workflows SQL Tools Other BigData Tools Machine Learning Summary Outline 1 Hadoop Ecosystem 2 User/Admin Interfaces 3 Workflows 4 SQL Tools 5 Other BigData Tools 6 Machine Learning 7 Summary Julian M. Kunkel Lecture BigData Analytics, WiSe 17/18 2/49