Transcription

QryGraph: A Graphical Tool for Big Data AnalyticsSanny Schmid, Ilias Gerostathopoulos, Christian PrehoferFakultät für InformatikTechnische Universität MünchenMunich, Germany{schmidsa, gerostat, prehofer}@in.tum.deAbstract—The advent of Big Data has created a rich set ofdiverse languages and tools for data manipulation and analyticswithin the Hadoop ecosystem. Pig has a prominent role within thisecosystem as a scripting layer—a convenient way to createanalytics jobs that are issued for batch processing in a Hadoopcluster. In order to leverage the benefits of graphical domainspecific languages, namely intuitive visual design and inspection,we implemented a web-based graphical tool called QryGraph thatcomplements Pig in various ways. First, it allows a user to createPig queries in a graphical editor and check their syntax. Second, itprovides an administrative interface for managing the executionand overall lifecycle of Pig queries. Finally, it will allow fordebugging by running queries on test data sets and for creatinguser-defined query sub-graphs that can be reused across differentPig queries.Keywords—Big Data; tool support; Pig languageI.INTRODUCTIONRecent advances in Big Data and Internet of Things technologieshave created new disruptive possibilities related to new insightthat can be obtained by analyzing the large quantities of senseddata [1]–[3]. More and more enterprises are looking into waysto build value-added services on top of Big Data analyticsinfrastructures.In this context, Apache Hadoop has emerged as the de factostandard in Big Data analytics. Hadoop is an open-sourceecosystem comprised of a multitude of languages and tools fordata manipulation and analytics, e.g. MapReduce, HDFS, Hive,Pig, HBase, Kafka, Storm, Spark, etc. It supports processing ofboth static data—batch mode—and streams of incoming data ina real-time fashion—stream mode. At the same time, itcomprises tools to address the needs of both developers,administrators, and data scientists and analysts. The mainadvantage of Hadoop is that it provides a fault-tolerantinfrastructure that can easily scale to several thousand nodes ina single cluster and to several petabytes of data. The data storedin a Hadoop cluster are analyzed in batch mode by developingapplication-specific “mappers” and “reducers”, i.e. functionsthat work on key/value pairs: mappers operate on input data (e.g.a large file residing in the Hadoop Distributed File System—HDFS [4]); reducers combine and summarize the results of themappers [5]. Once the application-specific mapper and thereducer is implemented in Java or another Hadoop-compliantimplementation language, they are bundled together into a singleanalytics job (also called a map-reduce program) that is issuedto the Hadoop cluster. This triggers the parallel execution of978-1-5090-1897-0/16/ 31.00 2016 IEEEseveral mapper and reducer tasks; the end result is then writtenback to the HDFS.Within the Hadoop ecosystem, the Pig platform and language(also called Pig Latin) provide a convenient way to createanalytics jobs compared to the manual implementation ofmappers and reducers in a general-purpose programminglanguage such as Java [6], [7]. Being a domain-specificlanguage, Pig is essentially a thin layer over Hadoop that allowsfor specifying succinct scripts for analytics jobs that load data,apply transformations on the data, and store the final results.Pig has been significantly enhanced since its inception at Yahoo!Research in 2006 to include several advanced features such aserror handling and type checking, together with severalperformance improvements [7]. We nevertheless believe thatthere is still room for improvement, in particular in supportingthe users of Pig (typically developers) in the creation, validationand evolution of complex analytics jobs.To this end, we implemented a tool for Big Data analytics calledQryGraph. Our tool supports Pig users in creating new analyticsjobs that comply with the Pig language via an intuitive graphicaleditor. The editor allows for both visual design and validationvia visual inspection. It provides an administrative interface formanaging the execution and overall lifecycle of analytics jobs,including testing and debugging features. Finally, it will provideenhanced debugging features, such as running queries on testdata sets, and will allow creating user-defined query sub-graphsthat can be reused across different Pig queries.In this artifact paper, we present the main functionalities anddesign goals of QryGraph, together with its technicalarchitecture and extensibility mechanisms. We detail on theongoing implementation and articulate our long-term plans. Ourinitial experience with using the tool indicates that it couldbecome a vehicle for enhancing the understanding, creation, andevolution of Big Data analytics.The rest of the text is structured as follows. Section II presentsthe running example and its solution in Pig. Section III presentsour tool and discusses the main features and usage scenarios.Section IV reflects on our experience so far with QryGraph, andits current limitations. Finally, Section V compares our approachin supporting Big Data analytics to the state of the art andpractice, and Section VI concludes and presents our future workplan.

car 1.57711.53211.521timestamp556677PL de11.57811.51211.501timestamp314455Fig. 1. Exemplary data sets in the running example.II.RUNNING EXAMPLE AND BACKGROUNDA. Running ExampleIn our running example, the administration of the city of Munichaims to implement a new pricing system for parking lots (PLs).The price of each PL should be calculated based on the numberof cars driving in its vicinity. The prices should be updated in aperiodic fashion to ensure a fair pricing allocation.To be able to implement the above mechanism, the city hasaccess to data collected from user cars and PLs belonging to cityrun parking stations. For each car, its position and speed areperiodically monitored; for each PL its position and availability.Data is stored locally for the course of a full day and submittedfor analysis as a full day batch. For illustration, the data setscould look like the ones depicted in Fig. 1.To get the necessary information of all driving cars near anavailable PL, the cars data set is filtered to only consider carswith a speed higher than, e.g., 5 km per hour (which indicatesthat they are not parked). This data set is combined with the PLdata sets in order to find cars driving near a PL. This creates aCOMMANDDESCRIPTIONEXAMPLELOADUsed to load datafrom the HDFSA LOAD ‘sample.csv’USING PigStorage(',') AS(name:chararray, age:int,gpa:float);FOREACHRun a commandfor each data rowB FOREACH AGENERATE name;GROUPGroups data intotuplesC GROUP B BY name;JOINPerforms a joinon two data setsN JOIN A BY name,K BY name;DISTINCTRemovesduplicate tuplesin a relation.C DISTINCT B;CROSSCreates the crossproduct of twodata setsF CROSS C, E;FILTERFilters a data setbased on anexpressionH FILTER G BY Boolean expr Fig. 2. Popular Pig commands.978-1-5090-1897-0/16/ 31.00 2016 IEEElist of PLs and number of cars driving near them. Thisinformation is then used to determine the price of a PL.B. Background: PigPig Latin [6] is a procedural query language for very large datasets. It offers an alternative to the well-known SQL standard forquerying HDFS and is designed to work with the Hadoopinfrastructure. Instead of one single relational query, a Pig queryconsists of a directed acyclic graph of nodes that can be seen asan execution plan. Within this graph, each node describes onestep that is needed to execute the query—from loading the data,to the final output. Pig optimizes a query on the fly beforesending it to the Hadoop cluster [7] and reaches a performancethat is comparable with native Hadoop implementations. It iswidely used as an alternative query language and is also includedinto popular Hadoop distributions like Hortonworks [8].The Pig language consists of a broad set of commands. Theresult of each command is assigned to a variable that can be usedby other commands in the query. A number of popular Pigcommands are depicted in Fig. 2. An important difference toDEFINE Distancedatafu.pig.geo.HaversineDistInMiles();-- Create a list of PLs and their positionA LOAD 'slots.csv' USING PigStorage(',') as(PL ID:int, AVAILABLE:int,LONGITUDE:double, LATITUDE:double,TIMESTAMP:long);B FOREACH AGENERATE PL ID, LONGITUDE, LATITUDE;C DISTINCT B;-- Create a list of all cars that were drivingD LOAD 'cars.csv' USING PigStorage(',') as(CAR ID:int, SPEED:int, LONGITUDE:double,LATITUDE:double, TIMESTAMP:long);E FILTER D BY speed 5.0;-- Join the data by GPS distanceF CROSS C, E;G FOREACH F GENERATE *, Distance(C::LATITUDE,C::LONGITUDE, E::LATITUDE, E::LONGITUDE)as DISTANCE;H FILTER G BY DISTANCE 5.0;-- Count the amount of cars for each PLI GROUP H BY PL ID;J FOREACH I {distCars DISTINCT H.CAR ID;GENERATE 0, COUNT(distCars);};Fig. 3. Possible Pig query for the running example.Fig. 4. Abstract graph of the Pig query of the running example.

SQL is that Pig allows special data formats like TUPLE, BAG orMAP. A GROUP command, e.g., puts all elements into a BAG thatis associated with the group key. These additional formats canalso be used as nested data structures, e.g. a bag within a bag.Additional to the native Pig commands, there is the possibilityto use user defined functions (UDFs) that implement custombehavior that might require a complex computation. These canbe written in Java or Python and included into the Pig query asseen in Fig. 3 (line 1).For illustration, the running example could be modeled by thePig query depicted in Fig. 3—the corresponding abstract querygraph is depicted in Fig. 4.III.TOOL DESCRIPTIONQryGraph is a tool to simplify the creation, maintenance,evolution, and management of Big Data analytics jobs. Ananalytics job in QryGraph corresponds to a Pig language query.A Pig query is specified via the use of a graphical editor (theheart of QryGraph), which represents a query by its abstractgraph.The user is able to specify Pig queries via modeling the data flowbetween nodes corresponding to Pig language commands (seeFig. 2). Each command corresponds to a node in the graph;nodes’ input and outputs are connected to create the query graph.The user receives immediate feedback once a type error isintroduced in the design process (e.g. when trying to connect anode’s output of type A to another node’s input of type B). Oncea correct query graph is created, the tool automatically compilesit down to valid Pig code and presents it to the user forvalidation. The generated query can then be issued to a Hadoopcluster.Apart from the graphical editor, QryGraph provides also aninterface for keeping track of all created queries, for issuing inan on-demand or periodic schedule basis, and for notifying theuser for the termination of issued queries and presenting thequery results.In the following, we detail on the main design goals ofQryGraph, its main usage scenarios, as well as its technicalarchitecture and implementation.A. Main Design Goals1) Easy to useThe tool should reduce the cognitive barrier in understandingand creating complex Pig queries. For this, it needs an intuitiveand easy to use user interface that helps the novice designer increating a query (e.g., by offering the possible fields to filter byin a FILTER node). At the same time, it should offer full flexibilityto advanced Pig users, who might need to visually inspect oreven edit the generated Pig code.The tool should also provide an ergonomic interface formanaging created queries and configuring their schedulingpolicies.2) Quick feedbackThe tool should offer quick, preferably immediate, feedbackwhen designing a query, so that the designer can resolve typeerrors early on.The tool should also provide a “test run” of a query, i.e.,execution of the query on a small set of fabricated data insteadof production data. This would offer the possibility of semanticchecks and validation, answering the question “Does the queryactually perform what it is supposed to?”3) Reuse supportThe tool should offer the user the possibility to extract commonpatterns used across several queries and reuse them in newqueries as “composite nodes”. One such example could be acomposite node that joins GPS position data from two inputsbased on a user defined distance function.Fig. 5. QryGraph graphical editor.Fig. 6. Subgraph of “Join by GPS position” node.978-1-5090-1897-0/16/ 31.00 2016 IEEE

Fig. 7. Query management interface.4) Easy to setupThe tool should be setup with minimum effort from the user.This will allow prospective users to experiment with the tool andconsider contributing by extending its functionalities.B. UsageQryGraph offers two main functionalities to its users: querydesign in a graphical editor and query administration andlifecycle management.1) Query designWhen a user wants to create a new query or edit an existing one,he/she uses the graphical editor. The editor features a commandmenu on the left side and a big pane to create the query graph onthe right side (Fig. 5). Here the user can:-select the data sources (e.g. files in CSV format) of thequery and add them to the graph;-add native Pig functions (e.g., FILTER, GROUP, JOIN) asnodes to the graph;-edit the configuration and parameters of the nodes;-plug nodes together via connecting input ports to outputports;-automatically get instant feedback on possible typeerrors after every change;-inspect the Pig code that is generated on the fly from anerror-free graph.2) Query administration and managementWhen a user needs to obtain an overview of the created queries,manage their triggering policies, and view their sample resultshe/she uses the administrative interface of QryGraph (Fig. 7).Here the tool offers the user to:-review execution statistics on a dashboard;-list all queries the user has created;-pause and suspend query execution;-change the execution schedule of a query;-check the approval status of a query (each query has tobe approved at the server-side in order to run onproduction data—see Section III.C);-view the results of a query.C. Technical Architecture and ImplementationQryGraph has been implemented as an open-source ://github.com/Starofall/QryGraph.Fig. 8. QryGraph architecture – custom code (green) / libraries 16/ 31.00 2016 IEEEThe tool follows a client-server architecture (Fig. 8) and ismainly written in Scala. The client component is built withScala.js1, a library that compiles Scala into JavaScript. The

server is built with the Play Framework2; the dynamic serverclient communication is done using the actor-oriented AkkaLibrary3 and Akka.js4 with underlying web sockets. Thisconnection allows to update the client on compilation resultswithout using any long polling technique. Using the abstractionlayer provided by Akka actors, the client and servercommunicate via typed messages sent serialized through thesocket. The message classes and data structures are shared bythe server and client Scala code. This simplifies developmentand debugging since a node instance has the same behavior onboth the client and the server and no manual data parsing on theserver or the client is required.In summary, the internal workings of the graphical editor are thefollowing: Once a user edits an element of a graph, the graphobject is serialized and sent over the network to the server. Thenthe server performs syntax checking. If the graph is syntacticallycorrect, Pig code is generated out of it and sent back to the userfor visual inspection. Once the user wants to issue the query tothe Hadoop cluster, the execution request is send to the serverwhere it is added to the queue of queries that need manualapproval by the cluster administrator (an optional step in theprocess). Then, the query is issued to the Hadoop cluster. Oncethe results are computed, they are sent back to the user via thesame actor-based communication.IV.communication is conveniently abstracted into typed messagessent between Scala actors over a web sockets connection.C. Enhanced Testing and Debugging – In ProgressAt its current state, the graphical editor gives instant feedback tothe user, as changes are being made on the graph, regardingsyntax errors. This is performed at the server side by typechecking (e.g. checking that connected inputs and outputs are ofthe same type, GROUP operators operate on attributes given asinputs, etc.). This offers a mechanism to spot errors early on andrefactor the query.An additional mechanism (under development) is to allowsample runs of an error-free query for debugging reasons.Running a query on the production data can be a lengthy processtaking minutes or hours to complete. However, for debuggingpurposes, a much quicker response has to be provided to theuser. To this end, a test data set has to be provided against whichtest runs can be issued. Providing such a test data set is far fromeasy and typically has to be tailored to the particular userprogram at hand [6]. This data set has to be considerably smallerthan the production data set, yet still realistic; it can be used fortwo types of tests:DISCUSSION AND WORK IN PROGRESSQryGraph is a project under active development. In the initialstages documented in this paper, we were mainly focusing onimplementing the programming abstractions and overallinfrastructure that would allow to speed up feature development.For instance, we now have a very robust actor-based clientserver communication that allows for seamless development.-Fit-for-purpose validation, i.e. checking whether thequery has the intended effect;-Performance testing: When performing multiple queriesto the same test data set or test data sets of approx. thesame size, the execution times of past runs can be usedas an indicator of a low-performing query. Such analysisis based on the fact that the relative difference in theexecution times of queries is of importance, and not theactual values.A. Web-BasedThe QryGraph client has been implemented as a web-basedapplication. One of the primary drivers for doing so was the easeof use, as users are typically familiar with browserenvironments. At the same time, this allows the developers touse many off-the-shelf libraries that are available for JavaScriptand CSS (e.g. Twitter Bootstrap).D. Component System – In ProgressOne of the advantages of the graphical editor is that it makes thedata-flow architecture of a query explicit, which improvesprogram comprehension and maintainability. To enable reuse ofquery fragments (subgraphs) that are common across differentqueries, we are working on providing a mechanism that allowsthe creation of components with well-defined inputs and outputsout of query subgraphs. These components would then be usedas regular nodes in the graphical editor, similar to black-boxcomponent composition.The web-based environment is also making the tool setup easy:it is only required to deploy a JVM-based application on aserver. Then any user with a modern browser is able to startusing the tool and create queries.An example of such a component could be a subgraph that joinsGPS position data from two inputs based on a user defineddistance function. The creation of these component will be donevia the main graphical editor, as illustrated in Fig. 6.B. Seamless Client-Server DevelopmentIn order to leverage on the domain-specific modelingcapabilities and the type system of Scala, we opted for usingScala.js to generate most of the JavaScript code, in particular theparts related to the representation of queries on the client side.Using a well-known Scala library for actors, Akka, and itsScala.js counterpart, Akka.js, allows developers to work with thesame actor-based abstractions on both client and server side. TheDue to the recent advancements in Big Data infrastructures andtools and in the Hadoop ecosystem in particular, many differenttools and products have been proposed. We focus here on toolsthat offer graphical interfaces for viewing or specifying Big Dataanalytics jobs, thus are directly comparable to QryGraph.In the following, we reflect on the main architectural decisionsand on the features under ://akka.io978-1-5090-1897-0/16/ 31.00 2016 IEEEV.RELATED WORKThe open-source project PigPen [6], [9] is an extension to theEclipse platform that allows the user to specify Pig queries in a4https://github.com/unicredit/akka.js

textual editor and then inspect the corresponding query graph,which is updated on the fly. However, editing the graph is notsupported. PigPen also offers error checking, and running aquery in a “sandbox” data set for debugging. We intend to reusetheir approach in the creation of our test data set infrastructure.A mature open-source solution, also based in Eclipse, is TalendOpen Studio for Big Data [10]. It allows the specification of BigData analytics jobs in a graphical interface. Compared with thistool, QryGraph is more lightweight and requires less upfronteffort for setting up and getting started with the tool.When looking at closed-source solutions, Tableau Desktop [11]is a tool that offers an easy-to-use user interface for specifyingand running Big Data analytics. Instead of supporting thegeneration of a specific query, this tool focusses on datavisualization. Pentaho [12] is offering several closed sourceproducts that also help the user define a data flow using agraphical interface. They support multiple databaseconfigurations.VI.CONCLUSIONSIn this artifact paper, we presented our ongoing implementationand long-term plan for a new tool for simplifying the creationand management of Big Data analytics jobs. QryGraph focuseson the popular scripting language Pig and offers visualrepresentation and editing of Pig queries. It allows for rapidprototyping by on-the-fly compilation and syntax checking, andpromotes reuse by allowing for specifying user-defined querysubgraphs.One of the features that we would like to include in QryGraphin the future is the possibility for importing existing Pig queriesinto the graphical editor. This will allow for creating a library ofexample queries for educational purposes and of subqueries thatcan be reused.Finally, we would like to support cooperative query editing. Theexisting implementation based on web sockets already providesthe base upon which several users can work locally andcommunicate with the server with independent query updates.ACKNOWLEDGMENTThis work is part of the TUM Living Lab Connected Mobilityproject and has been funded by the BayerischesStaatsministerium für Wirtschaft und Medien, Energie undTechnologie.REFERENCES[1] J. Needham, Disruptive Possibilities: How Big DataChanges Everything. O’Reilly Media, 2013.[2] E. Dumbill, Planning for Big Data. O’Reilly Media,2012.[3] S. Srinivasa and V. Bhatnagar, Eds., Big Data Analytics,vol. 7678. Berlin, Heidelberg: Springer BerlinHeidelberg, 2012.[4] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “TheHadoop Distributed File System,” in Proceedings of the2010 IEEE 26th Symposium on Mass Storage Systemsand Technologies (MSST), Washington, DC, USA, 2010,pp. 1–10.978-1-5090-1897-0/16/ 31.00 2016 IEEE[5] J. Dean and S. Ghemawat, “MapReduce: Simplified DataProcessing on Large Clusters,” Commun. ACM - 50thAnniv. Issue, vol. 51, no. 1, pp. 107–113, Jan. 2008.[6] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A.Tomkins, “Pig Latin: A Not-so-foreign Language forData Processing,” in Proceedings of the 2008 ACMSIGMOD International Conference on Management ofData, New York, NY, USA, 2008, pp. 1099–1110.[7] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M.Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, andU. Srivastava, “Building a High-level Dataflow Systemon Top of Map-Reduce: The Pig Experience,” ProcVLDB Endow, vol. 2, no. 2, pp. 1414–1425, Aug. 2009.[8] “Hortonworks,” 01-May-2016. [Online]. Available:http://hortonworks.com/.[9] “PigPen Wiki,” 01-May-2016. [Online]. Available:https://wiki.apache.org/pig/PigPen.[10] “Talend Open Studio for Big Data,” 01-May-2016.[Online]. en-studio.[11] “Tableau Desktop,” 01-May-2016. [Online]. sktop.[12] “Pentaho,” 01-May-2016. [Online]. gration.

QryGraph: A Graphical Tool for Big Data Analytics Sanny Schmid, Ilias Gerostathopoulos, Christian Prehofer Fakultät für Informatik Technische Universität München Munich, Germany {schmidsa, gerostat, prehofer}@in.tum.de Abstract—The advent of Big Data has created a rich set of diverse languages and t