Green-Marl: A DSL for Easy and EfficientGraph AnalysisSungpack Hong*, Hassan Chafi* , Eric Sedlar ,and Kunle Olukotun**Pervasive Parallelism Lab, Stanford University Oracle Labs

Graph Analysis Classic graphs; New applications Artificial Intelligence, Computational Biology, SNS apps: Linkedin, Facebook, Graph Analysis: a process ofExample Movie Databasedrawing out further informationfrom the given graph data-set“What would be the avg. hop-distancebetween any two (Australian) vatar“Is he a central figure in the movienetwork? How much?”Kevin BaconSigourneyWeaver,,Aliens“Do these actors work togethermore frequently than others?”Ben StillerJack BlackOwen Wilson

More formally , Graph Data-Set Graph G (V,E): Arbitrary relationship (E) betweendata entities (V) Property P: any extra data associated with each vertexor edge of graph G (e.g. name of the person) Your Data-Set (G, Π) (G, P1, P2, )Graph analysis on (G, Π) Compute a scalar value Compute a (new) property e.g. Avg-distance, conductance, eigen-value, e.g. (Max) Flow, betweenness centrality, page-rank, Identify a specific subset of G: e.g. Minimum spanning tree, connected component, communitystructure detection,

The Performance Issue Traditional single-core machines showed limitedperformance for graph analysis problems A lot of random memory accesses data does not fitin cache Performance is bound to memory latency Conventional hardware (e.g. floating point units) doesnot help muchUse parallelism to accelerate graph analysis Plenty of data-parallelism in large graph instances Performance now depends on memory bandwidth, notlatency. Exploit modern parallel computers: Multi-core CPU,GPU, Cray XMT, Cluster, .

New Issue:Implementation Overhead It is challenging to implement a graphalgorithm correctly and efficiently while applying parallelism differently for each execution environmentAre we really expecting a single (averagelevel) programmer to do all of the above?

Our approach: DSL We design a domain specific language (DSL) for graph analysis The user writes his/her algorithm concisely with our DSL The compiler translates it into the target language (e.g. parallelC or CUDA)(1) Inherent data-parallelismIntuitiveDescription of agraph algorithmForeach (t: G.Nodes)t.sigma ,DSL(2) Good impl. templates(3) High-level optimizationEfficient (parallel)Implementation ofthe given algorithm,,For(i 0;i G.numNodes();i ) {fetch and add(G.nodes[i], ,)EdgesetForeachBFSDSLCompilerTarget Language(e.g. C )Source-to-Source Translation

Example: Betweenness Centrality Betweenness Centrality (BC) A measure that tells how ‘central’a node is in the graph Used in social network analysis Definition Low BCHow many shortest paths arethere between any two nodesgoing through this node.High BCKevinBaconAyush K.Kehdekar[Image source; Wikipedia]

Init BC for every nodeBetweenness CentralityExample:and begin outer-loop (s)[Brandes 2001]LookscomplexsQueues, Lists,Stack,Is thisparallelizable?wBFSOrderwvCompute sigma from parentssReverseBFSOrdervwwwCompute delta from childrenAccumulate delta into BC

Example: Betweenness Centrality[Brandes 2001]swBFSOrderwvCompute sigma from parentssReverseBFSOrdervwwwCompute delta from children

Example: Betweenness Centrality[Brandes 2001]swParallel IterationBFSOrderwvCompute sigma from ervwwwCompute delta from childrenReduction

DSL Approach: Benefits Three benefits ProductivityPortabilityPerformance

Productivity Benefits A common limiting resource in software development your brain power (i.e. how long can you focus?)A C implementationof BC from SNAP ( aparallel graph libraryfrom GT): 400 line of codes (withOpenMP)Vs. Green-Marl* LOC: 24*Green-Marl (그린 말) meansDepicted Language in Korean

Productivity Benefits ProcedureManualLOCGreen-MarlLOCSourceMiscBC 40024SNAPC openMPVertex Cover7121SNAPC openMPConductance4210SNAPC openMPPage Rank7515http:// .C single threadSCC6515http:// .Java single threadIt is more than LOC Focusing on the algorithm, not its implementation More intuitive, less error-prone Rapidly explore many different algorithms

Portability Benefits (On-going work) Multiple compiler targetsDSLDescription DSLCompilerSMP back-end(Parallelized)C CUDA forGPUCodes forClusterLIB (& RT)LIB (& RT)LIB (& RT)Cluster back-end (*) Command lineargumentFor large instancesWe generate codes that work on Pregel API [Malewiczet al. SIGMOD 2010]GPU back-end (*) For small instances We know some tricks [Hong et al. PPOPP 2011]

Performance BenefitsBack-end specificoptimizationGreen-Marl CodeTarget Arch.(SMP? GPU?Distributed?)Optimized data structure& Code templateThreading Lib,(e.g.OpenMP)Graph Data StructureCompilerParsing &CheckingArch.IndependentOptUse GenerationTarget Code(e.g. C )

Arch-Indep-Opt: Loop FusionForeach(t: G.Nodes)t.A t.C 1;Foreach(s: G.Nodes)s.B s.A s.C;LoopFusionForeach(t: G.Nodes) {t.A t.C 1;t.B t.A t.C;}“set” of nodes(elems are unique)C compiler cannot mergeloops(Independence notgauranteed)Map Node, int A, B, C;List Node & Nodes G.getNodes();List Node ::iterator t, s;for(t Nodes.begin(); t ! Nodes.end(); t )A[*t] C[*t];for(s Nodes.begin(); s ! Nodes.end(); s )B[*s] A[*s] C[*s];Optimization enabled by high-level(semantic) information

Adding 1 to for allOutgoing Neighbors,if my B value ispositiveArch-Indep-Opt: Flipping Edges Graph-Specific OptimizationForeach(t: G.Nodes)Foreach(s: t.InNbrs)(s.B 0)t.A 1;Foreach(t: G.Nodes)(t.B 0)Foreach(s: t.OutNbrs)s.A 1;ssttsCounting number ofIncoming Neighborswhose B value is positives(Why?) Reverse edges may not beavailable or expensive to computeOptimization using domain-specificproperty

Arch-Dep-Opt : Selective Parallelization Flattens nested parallelism with a heuristicCompiler choosesForeach(t: G.Nodes) {parallel region,Foreach(s: G.Nodes)(s.X t.Y) {heuristicallyForeach(r: s.Nbrs) {For (t: G.Nodes) {s.A r.B;Foreach(s: G.Nodes)(s.X t.Y) {}For (r: s.Nbrs) {t.C * s.A;Three levels ofs.A r.B;}nestedparallelism[Why?]}val min t.C reductions Graph is larget.C * s.A;} # core is small.}For (t: G.Nodes) { There isval min t.CForeach(s: G.Nodes)(s.X t.Y) {overhead for}For (r: s.Nbrs) {parallelizations.A s.A r.B;}t.C * s.A;Reductions became}normal read & writeval (t.C val) ? t.C : val;}Optimization enabled by botharchitectural and domain knowledge

Code-Gen: Optimization enabled by code analysis(i.e. no BFS library could do thisSavingDownNbrs in BFSautomatically)Prepare data structure for reverse BFS traversalduringcodeGeneratedsaves edges to theforward traversal, only if required.InBFS(t: G.Nodes From s) { }InRBFS {Foreach (s: t.DownNbrs) }Compiler detects thatdown-nbrs are used inreverse traversalGenerated code caniterate only edges todown-nbrs duringreverse traversal// Preperation of down-nbrs duringBFS forward traversal.// Forward BFS (generated){ // k is an out-edge of sfor(k )node t child get node(k);if (is not visited(child)) { ;// normal BFS code hereedge bfs child[k] true;} } }// Reverse BFS (generated){ // k is an out-edge of sfor(k ) {if (!edge bfs child[k]) continue; } }

Code-Gen: Code Templates Data Structure Graph: similar to a conventional graph library Collections: custom implementationCode Generation Template BFS Hong et al. PACT 2011 (for CPU and GPU) Better implementations coming; can be adaptedtransparently DFS Inherently sequentialCompiler takes any benefits that a (template)library would give, as well

Experimental Results Betweenness Centrality Implementation(1) [Bader and Madduri ICPP 2006](2) [Madduri et al. IPDPS 2009] Apply some new optimizations Performance improved over (1) x2.3 on Cray XMT Parallel implementation available in SNAP library basedon (1) not (2) (for x86) Our Experiment Start from DSL description (as shown previously)Let the compiler apply the optimizations in (2),automatically.

Parallel performancedifferenceExperimental ResultsEffects of other optimizations Flipping Edges Saving BFS childrenNehalem (8 cores x 2HT), 32M nodes, 256M edges (two different synthetic graphs)Shows speed up overBaseline: SNAP(single thread)Better single thread performance:(1) Efficient BFS code(2) No unnecessary locks

Other ResultsConductancePerf similar tomanual impl. Loop Fusion PrivitizationVertex Cover Test and Test-set PrivitizationOriginal code data race;Naïve correction(omp critical) serialization

parallelization as much as exposedOtherAutomaticResultsdata parallelism (i.e. there is no black magic)PageRankCompare against Seq. ImplStronglyConnectedComponentDFS BFS:Max Speed-up is 2(Amdahl's Law)

Conclusion Green-Marl A DSL designed for graph analysisThree benefits ProductivityPerformancePortability (soon) Project page: marl.html GitHub repository:

We design a domain specific language (DSL) for graph analysis The user writes his/her algorithm concisely with our DSL The compiler translates it into the target language (e.g. parallel C or CUDA) (1) Inherent data -parallelism (2) Good impl. templates Efficient (parallel) Implementation of the given algorithm For(i 0;i G.numN odes();i )