Transcription

GRainDB: A Relational-core Graph-Relational DBMSGuodong JinNafisa AnzumSemih [email protected] University of [email protected] of ty of WaterlooCanadaABSTRACTEver since the birth of our field, RDBMSs and several classes ofgraph database management systems (GDBMSs) have existed sideby side, providing a set of complementary features in data models,query languages, and visualization capabilities these data modelsprovide. As a result, RDBMSs and GDBMSs appeal to differentusers for developing different sets of applications and there is immense value in extending RDBMSs to provide some capabilities ofGDBMSs. We demonstrate GRainDB, a new system that extendsthe DuckDB RDBMS to provide graph modeling, querying, andvisualization capabilities. In addition, GRainDB modifies the internals of DuckDB to provide a set of fast join capabilities, such aspredefined pointer-based joins that use system-level record IDs(RID) and adjacency list-like RID indices, to make DuckDB moreefficient on graph workloads.1INTRODUCTIONEver since the birth of database management systems, relationsand graphs have been the core data structures to model applicationdata in two broad classes of DBMSs: RDBMSs and those referred toas graph database management systems (GDBMSs). Historically, theterm GDBMS has been used to refer to several classes of DBMSsthat all adopt a graph-based model, such as the CODASYL model,RDF, and most recently the property graph model, which is adoptedby contemporary systems such as Neo4j [5], TigerGraph [7], GraphflowDB [20], or AvantGraph [1].RDBMSs and GDBMSs have complementary capabilities thatmay appeal to users. For example, unlike graphs, the relationalmodel is not restricted to binary relationships and easily modelsn-ary relationships between entities. At the same time, the relational model requires strict schematization of the data, while graphmodels are often semi-structured and provide more flexibility whenstoring sparse data. Similarly, while SQL is very suitable to expressstandard data analytics tasks, it is arguably not as suitable whenexpressing queries with recursive joins, for which the languagesof GDBMSs contain specialized syntaxes. In terms of performance,RDBMSs integrate numerous techniques, such as columnar storageand vectorized processing, yet, still have shortcomings on graphworkloads that contain many recursive and many-to-many joins,for which GDBMSs employ specialized techniques.RDBMSs have seen wider adoption than GDBMSs in practiceand emerged as the de-facto systems to store, manage, and queryapplication data. However, given the complementary capabilitiesof GDBMSs over RDBMSs, there are economical and technicalThis paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution, provided that youattribute the original work to the authors and CIDR 2022. 12th Annual Conference onInnovative Data Systems Research (CIDR ’22). January 9-12, 2022, Chaminade, USA.advantages for extending RDBMSs to natively provide some ofthe capabilities of GDBMSs and support efficient graph querying.Over the past two years, we have started to develop a relationalcore hybrid graph-relational system that we call GRainDB at theUniversity of Waterloo. We use the term relational-core to indicatethat GRainDB extends an RDBMS at its core. Specifically, GRainDBintegrates a set of storage and query processing techniques, suchas predefined pointer-based joins (reviewed in Section 4.1), intothe columnar DuckDB RDBMS [2, 24] to make it more efficient ongraph workloads. In addition, GRainDB extends DuckDB to nativelysupport a set of new features: Hybrid graph-relational data modeling: Users can model partsof their database also as a graph. Specifically, users can modelrelations as nodes and joins between these relations as edges. Hybrid graph-relational querying: Users can query tables andgraphs seamlessly using an extended SQL that we call GRQL.GRQL has drawn from TigerGraph’s GSQL [13] and Oracle’sPGQL languages [28]. The FROM clause of GRQL can containpath patterns in addition to tables to express joins. Path patternsare described with a node and arrow syntax and can containspecial syntax for recursive joins, e.g., the Kleene star. Graph visualization: We have built a browser frontend, similarto the frontends of GDBMSs, such as Neo4j’s Bloom [6], thatsupport node-link visualization and interactive exploration ofthe part of the database that is modeled as a graph.Our work is similar in spirit to graph systems built on top of somecommercial RDBMSs, such as IBM Db2 [27] and SAP Hana [25].For example, the Db2 Graph project [27] implements the Gremlingraph query language [4] on top of IBM Db2. Unlike GRainDB,these layered systems support only querying graphs and do notmodify the core query processors of the underlying RDBMS.In this paper, we describe our vision, progress, and ongoingwork on GRainDB, and a demonstration of developing a COVID-19contact tracing application on GRainDB. Our application aims todemonstrate: (i) the advantages of modeling and querying dataseamlessly both as tables and graphs and the ability to visualizeparts of the database as a graph; and (ii) the storage and processing techniques we have integrated into DuckDB to make it moreefficient on graph workloads. The source code of GrainDB can befound here [3].2DEMONSTRATION DATABASEWe begin by describing a relational database that we use as ourrunning example. We consider a COVID-19 contact tracing application in the city of Waterloo, Ontario in Canada. Figure 1 shows theschema of the database used by the application: Person contains data about people who took COVID-19 tests. Contact contains the close contact relationship between twopeople in the Person table, one of which has tested positive.

Query Languages: SQL has emerged as the de facto language toquery and manipulate data in practice. While SQL is very suitableto express many standard data preparation and analytics tasks,languages of GDBMSs contain syntax that can more easily expressrecursive joins. Consider asking a variable-length join query askingfor people that Mahinda may have contacted through 1 to 4 degreesof contacts. Consider the the alternative graph model of our exampledatabase. For example, the Cypher language of Neo4j contains aspecial syntax for such recursive queries, which can be simplerthan writing those queries in SQL:MATCH (a:vPerson)-[:eContacted*1.4]- (b:vPerson)WHERE a.name MahindaFigure 1: Example relational database schema. Top row is thegraph schema modeling part of the database also as a graph.Similarly, languages of GDBMSs also contain specialized syntax forexpressing some queries that are unions of multiple join queries,such as queries with unlabeled nodes and edges.Visualization: The output tables of queries in RDBMSs can beused to produce many useful visualizations, such as bar charts orline charts. However, these visualizations are not suitable whenone wants to analyze the sequences of connections and paths thatconnect the entities in a database. Default frontends of GDBMSssupport node-link views that are more suitable for this purpose.Our first goal in GRainDB is to extend an RDBMS to complementit with the above features of GDBMSs. One can in principle providethese features only by building layers on top of the underlyingRDBMS without modifying the internals of the system. This approach however would be sub-optimal in performance. Our secondgoal in GRainDB is to modify the underlying RDBMS to make itmore efficient on graph workloads. This is a colloquial term to referto workloads that contain many equality joins over many-to-manyrelationships that can frequently be cyclic and recursive.Techniques for Fast Join Capabilities: We aim to integrate threetechniques into the underlying RDBMS, the first of which is motivated by how existing GDBMSs often perform joins. The lattertwo are relatively recent techniques developed in the context ofRDBMSs for many-to-many joins but not yet seen wide adoption. Predefined pointer-based joins: Joins in GDBMSs are pointer-basedin nature and predefined to the system as edges1 . Node recordsare joined with their neighbors using system-level integer IDs,which serve as pointers to look up matching neighbor IDs in anadjacency list index. This contrasts with value-based nature ofjoins in RDBMSs, where tables are often joined with each otherusing the values inside columns. Value-based joins are naturallymore general as users can join arbitrary tables. Yet they canbe less efficient because they typically do not use an adjacencylist-like join index and may be on non-integer types, e.g., strings. Factorization: Factorization [22, 23] is a technique to avoid datareplication in results of queries that contain joins over manyto-many relationships. As a simple example, consider joining aperson 𝑝 1 with its 𝑘 contacts 𝑝 1,1 , ., 𝑝 1,𝑘 , which would result in𝑘 tuples (𝑝 1, 𝑝 1,1 ), ., (𝑝 1, 𝑝 1,𝑘 ) that replicate the value 𝑝 1 𝑘 times.Such replication becomes severe when queries contain furthermany-to-many joins. Factorization avoids such replication by Place contains information about places, such as universitiesand restaurants, that record their visitors for contact tracing. Visit stores the records, collected by the places in the Placetable, of who visited each place and when. Zipcode contains the set of possible zipcodes in Waterloo and isused to normalize addresses in the Person and Place tables.Throughout the paper, we discuss modeling a part of this database also as a graph as shown in the top row of Figure 1. Wemodel the Person and Place tables as nodes with labels vPersonand vPlace, respectively. We then model the join between a personrecord 𝑝𝑒 and a place record 𝑝𝑙 that 𝑝𝑒 visited (over the Visit table)as eVisit edges and the join between two people who contactedeach other (over the Contact table) as eContact edges.3THE CASE FOR GRAINDBWe next motivate GRainDB by discussing the complementary features of RDBMSs and GDBMSs in terms of their data models, querylanguages, and visualization capabilities that their data modelsprovide. We then discuss several performance features we plan tointegrate into DuckDB to make it more efficient on graph workloads,which completes the description of our vision for GRainDB.Data Model: Although both relations and graphs can model manyapplication data, we identify two advantages of relations: N-ary relationships: Graphs are restricted to represent binaryrelations between two nodes, while relations can represent n-aryrelationships for arbitrary values of 𝑛. Normalized data: Relations are often arguably the natural structures to model normalized data. For example, the Zipcode tablein our example normalizes the zipcodes in Person and Placetables and may not be naturally thought of as nodes or edges.Similarly, we identify two advantages of graph-based models: Closeness to entity-relationship model: As discussed in a user survey conducted in our group [26], some users of GDBMSs findnodes and edges closer to their mental model of entities andrelationships in their data than tables. Semi-structured data: Graph-based models are often semi-structured, allowing arbitrary properties to be stored on nodes and edgeswithout a strict schema definition, which may have advantagesin applications that frequently integrate data from new sources.1 The term predefined was used by Ted Codd in his Turing Lecture to describe the joinsin GDBMSs of 1960s and 1970s that were based on the CODASYL model [12]2

that 𝑟 𝑓 points to. RIDs are dense integer-based system-level IDs incolumnar RDBMS that are used to identify the physical locationsof the column values of each row. They are therefore system-levelpointers, similar to node IDs in GDBMSs.RID columns are used during query processing as follows. Ifa query contains a predefined join, say between Person.id Contact.p1id, we replace some of the hash join and scan operators in DuckDB’s default plan as follows. We replace the hash joinoperator evaluating Person.id Contact.p1id predicate with amodified hash join operator that we call SIPJoin, for sidewaysinformation passing join. We also replace the scan operators forPerson and Contact tables with modified scan operators (explainedmomentarily), that scan RID values into tuples. Given the scannedRID values, SIPJoin performs the join on integer RIDs instead ofthe original id and p1id columns. In addition, SIPJoin can passmatching RIDs from its build side in a bitmask that can be used bythe scan operator (called ScanSJ, for scan semijoin) on its probeside to perform a semijoin when scanning tuples. This ensures thatonly the tuples that are guaranteed to successfully join are scanned,which can decrease the scans and probes significantly. Indexing Sub-system: A common way to represent many-tomany relationships between two sets of entities in relational databases is to have a table 𝐶 that contains two foreign keys on two other(not necessarily different) tables 𝑃1 and 𝑃2 , as in the Contact tablein our running example. If the join of 𝐶 with both 𝑃1 and 𝑃2 havebeen predefined to the system, users can additionally build a RIDindex on 𝐶 on the two extended RID columns 𝑅𝐼𝐷𝑝1 and 𝑅𝐼 𝐷𝑝2(using a BUILD RID INDEX command). This index is analogous toadjacency list indexes in GDBMSs and is stored in an adjacency listformat. RID indexes can be used by SIPJoin to generate furtherinformation to pass when a query joins 𝑃 1 or 𝑃2 with 𝐶.Figure 2: GRainDB Overview. Features marked as are ourongoing and future works.representing results as unions of Cartesian products, e.g., as(𝑝1 {𝑝 1,1, ., 𝑝 1,𝑘 }). Worst-case Optimal (wco) Join Algorithms: While factorization isgeared toward queries with acyclic joins, the recent worst-caseoptimal join algorithms are geared for cyclic join queries overmany-to-many joins. The seminal work of Atserias et al. [10]observed that traditional binary join plans are suboptimal oncyclic join queries, which was corrected by the recent wco joinalgorithms [21, 29]. Unlike binary join plans that perform joinspairs of tables-at-a-time, wco join algorithms join multiple tablesone column-at-a-time.4GRAINDB OVERVIEWFigure 2 shows the overview of GRainDB. We are developing GRainDBas an extension of DuckDB, which is a new open-source columnar read-optimized RDBMS that implements several modern queryprocessing techniques, such as columnar storage [8], vectorizedoriented query processing [8], and morsel-driven parallelism [18].We opted for an analytical read-optimized RDBMS as one of ourgoals is to efficiently support graph workloads, which are readheavy. GRainDB extends DuckDB both internally and through several new components, which we next review, highlighting our prioras well as ongoing work.4.1Example 1. Figure 3 shows example instances of the Personand Contact tables when two joins are predefined: Person.id Contact.p1id and Person.id Contact.p2id. This results in anextended Contact table that contains two RID columns, R1 and R2,that contain the RIDs of the person tuples that match p1id and p2idcolumns, respectively. For example the second tuple 𝑡 2 in Contacthas R1 value 2 because 𝑡 2 .p1id references 303 (Mahinda’s ID), whichhas an (implicit) RID value 2 in the Person table. Now consider thefollowing SQL query:SELECT P2.name FROM Person P1, Contact C, Person P2WHERE P1.id C.p1id AND P2.id C.p2idAND P1.name 'Mahinda'Prior Work: Predefined Pointer-based JoinsUsing Sideways Information Passing (SIP)Our first work on GRainDB [16] focused on extending the internalsof DuckDB to support predefined pointer-based joins to improvethe systems’ performance on many-to-many joins. We briefly review our technique here and refer readers to reference [16] for thedetails. We integrated predefined joins into DuckDB by extendingtwo components of the system: (i) the physical storage and queryprocessor; and (ii) the indexing sub-system. Physical Storage and Query Processor: Users first predefine aprimary-foreign key join from table 𝐹 to table 𝑃, where a columnof 𝐹 has a foreign key to a column of 𝑃, using a PREDEFINE JOINclause we added to DuckDB’s SQL dialect. This performs an ALTERTABLE command that inserts an additional 𝑅𝐼𝐷𝑝 column to 𝐹 thatcontains for each row 𝑟 𝑓 in 𝐹 the row ID (RID) of row 𝑟 𝑝 in 𝑃This query asks for the name of the first-degree contacts of Mahinda.Figure 3c is an example plan in GRainDB for this query using SIPJoinand ScanSJ operators. For example, the bottom SIPJoin receives theRID of the tuple 𝑡 2 from Person where 𝑡2.𝑅1 is 2 (corresponding toMahinda’s RID). Using the RID index, the RID of the only matchingtuple from Contact is generated (with RID 1) and passed to the probeside ScanSJ as a bitmask. ScanSJ then performs a semijoin and scansonly the tuples whose RIDs are in the bitmask.Our prior work [16] presents extensive experiments demonstrating the performance benefits of enhancing DuckDB with predefinedjoins on both graph and relational workloads with many-to-manyjoins, e.g., the LDBC social network benchmark [9]. Predefined3

(a) Example tables.(b) Example RID index.(c) Example join plan.Figure 3: (a) Example instances of Person and Contact tables. The Contact table contains the two RID columns, R1 and R2 whichare materialized after two joins to the Person table predefined on p1id and p2id columns, respectively. (b) The RID index onContact. (c) An example join plan for the query in Example 1 that uses SIPJoin and ScanSJ operators.R𝑉 and R𝑊 , respectively, in two possible ways: (i) as a direct primarykey-foreign key join between R𝑉 and R𝑊 ; or (ii) as two primary keyforeign key joins, one from R𝑉 to a “relationship” table E and theother from R𝑊 to E. For example, the following command definesan eContact edge label from vPerson to vPerson nodes throughtwo joins to the Contact table:joins is purely a relational query processing optimization that isindependent of GRainDB’s graph modeling and querying capabilities. That said, as we next discuss, when users model parts oftheir database as a graph, we automatically predefine the joinsimplied by the modeled edges and build the necessary RID indexesin DuckDB. For example, in our running example, of which tablesare shown in Figure 3a, users map the table Person to a node label vPerson, and define an edge label eContact from vPerson tovPerson. The edge label is mapped from the table Contact throughtwo equality joins between Person.ID and Contact.p1id, and between Person.ID and Contact.p2id, respectively. Automatically,GRainDB predefines these two joins, which adds two RID columnsR1 and R2 to Contact table for p1id and p2id, respectively. Additionally, GRainDB creates RID indexes on these two extended RIDcolumns (R1 and R2).After the modeling, given a query, we perform a rule-basedquery optimization based on DuckDB’s query plan. Specifically, werecursively traverse DuckDB’s default logical plan and find eachjoin operator that evaluates a predefined join. Upon finding thesejoins, we replace them with our SIPJoin operator, and also replacecorresponding probe-side Scan operators, to which bitmasks arepassed, with ScanSJ.4.2DEFINE EDGE LABEL eContact FROM vPerson V TO vPerson WON V.id Contact.p1id AND W.id Contact.p2idEdge definition commands predefine the joins in the edge definitionto DuckDB and build the necessary RID indexes. We store thedefined vertices and edges in DuckDB’s catalog and use it whencompiling path patterns to query plans, as we next explain.Path patterns: Path patterns visually describe joins between thetables that were defined in node and edge definitions using nodeand arrow syntax. Our syntax draws directly from existing languages of GDBMSs, such as Cypher, GSQL, or PGQL. Path patterns can include syntax to describe fixed-length joins, such as(a:vPerson)-[:eContact]- (b:vPerson), or two types of recursive joins: (i) variable-length joins that provide a minimum andmaximum lengths on the joins/paths; or (ii) transitive closure ofa join using the Kleene star syntax. For example, in the pattern(a:vPerson)-[:eContact*2.5]- (b:vPerson), variable 𝑏 denotes 2 to 5 degree contacts of nodes matching variable 𝑎.Path patterns can appear in the FROM clause to form queriesthat seamlessly query graphs and relations.Graph Modeling, Querying, andVisualizationAs part of this demonstration paper, we have extended GRainDBwith two components to provide graph modeling, querying, andvisualization capabilities: (i) GRQL; and (ii) a browser frontend.Example 2. Consider a query that asks for the top 50 highest riskpeople 𝑏 who have the most pathways to someone 𝑎 infected withthe virus, where 𝑎 lives in one of the zipcodes in high risk regions ofWaterloo. We define a pathway as a 1 to 3-length contact path from 𝑎to 𝑏. This query can be expressed in GRQL as follows:4.2.1 GRQL. We added further extensions to SQL to define: (i)nodes and edges; and (ii) path patterns that can seamlessly existin the FROM clause with tables to describe joins over the definednodes and edges. We refer to this extended SQL as GRQL.Node and edge label definitions: We added two commands: DEFINE NODE LABEL and DEFINE EDGE LABEL, that form our relationto-graph transformation language. The former simply maps anyrelation in the database to a node label, e.g., the Person table to avPerson node label. The latter defines an edge between any two(not necessarily distinct) node labels V and W, say mapping relationsSELECT b.name, count(*) as numPathwaysFROM (a:vPerson)-[:eContact*1.3]- (b:vPerson), zipcodeWHERE a.test result ' ' AND a.zipcode zipcode.code ANDzipcode.region IN HighRiskZipcodesORDER BY numPathways DESC LIMIT 50The omitted HighRiskZipcodes is an inner (pure) SQL query thatcomputes the highest risk zipcodes based on the number of infections4

in that zipcode. The query seamlessly joins a path pattern with theZipcode table.to develop a query processor that is as general as FDB but usesDuckDB’s vectorized and pipelined processor.WCO Joins: Several works have implemented wco join algorithmsin the context of both RDBMSs and GDBMSs. In the context ofGraphflowDB, we have proposed techniques that assume the existence of presorted indices and and use fast merge-sort like joinoperators [17, 19, 20]. This has a performance advantage duringquery evaluation but slows down updates. Reference [14] insteadintegrates wco joins in the Umbra system using hash-indexes thatare built on the fly. As future work, we plan to integrate WCO joinsinto DuckDB’s predefined joins using an approach that sorts RIDindices on the fly and uses merge-sort like operators and study thetradeoffs of this design compared to prior approaches.Semi-structured Sparse Data Another challenge is how to integrate support for semi-structured attributes in DuckDB. A simplebut not performant approach would be to add a default column toeach table that stores a blob that encodes unstructured propertiesand parse this blob during query processing. Whether this capability can be provided more efficiently and without making the systemvery complex is an interesting future direction.To compile queries with path patterns, we modified DuckDB’sparser to generate an ast that recognizes path patterns. In addition,we modified DuckDB’s planner, which takes the output ast of theparser and generates an initial logical plan, to rewrite fixed-lengthpaths as default join trees and variable-length paths as recursivecommon table expressions (CTEs). We did not modify the rest ofthe pipeline: as before DuckDB generates a default optimized planfrom the initial plan and if any part of the plan can benefit frompredefined joins, we replace necessary hash join and scans (possiblyinside the subplans evaluating recursive CTEs) with SIPJoin andScanSJ operators.4.2.2 GRainDB Browser Frontend. We extended GRainDB with abrowser frontend, which contains two visual components: (1) aninteractive Graph Schema Designer (GSD); and (2) an interactiveGraph Visualization Panel (GVP), shown in Figure 4. GSD allowsusers to model their graphs visually and interactively instead oftyping our node and edge definition commands. On the left panelof GSD users have access to the schema of their relational database.Users drag and drop any table’s schema to define a node label. Usersthen draw edges between the defined node labels to define edges.The joins defining the edges are described by selecting join columnsand the edge table (if any) in drop down menus.GVP is a frontend to ask queries in GRQL to GRainDB andvisualize the results both as tables and as graphs when possible.Specifically, if projections in the SELECT clause contain variablesthat were used to describe nodes and edges in path patterns, GVPcan visualize these outputs in an interactive node link diagram, asshown in Figure 4b. Users can click on nodes to expand the nodes totheir neighborhoods, which are obtained by further GRQL queriesto GRainDB.4.35DEMONSTRATION SCENARIOSOur demonstration illustrates three aspects of GRainDB: (i) the easeof development of an application using GRainDB’s hybrid graphrelational modeling capability and GRQL language; (ii) the benefitsof enhancing an RDBMS with graph visualization capabilities; and(iii) GRainDB’s query plans that perform our SIP- and semijoinbased evaluation of predefined joins.GRainDB Application Development: Our first scenario walksattendees through the steps of developing a reporting applicationfor Waterloo Public Health Services (WPHS) using GRainDB. Theapplication runs routine queries at the end of each day to generatedaily reports for managing the pandemic at WPHS. The queriesgenerating the reports will include those that are easily written inSQL, as well as those that can benefit from GRQL’s path patterns.Using the Graph Schema Designer, the attendees will first visually model the Person and Place relations as nodes and thendefine the eContact and eVisit edges between them (without going into the details of the predefined joins and created RID indicesin the underlying DuckDB). We will then go to the query panel inGRainDB’s browser frontend to show attendees the queries writtenboth in pure SQL and GRQL to highlight the differences and easeof using specialized syntax for graph patterns. For example, onereport will generate the list of high risk individuals based the number of pathways they may have to infected people. We will showthat this can easily be done with the GRQL query in Example 2that seamlessly joins recursive path patterns with tables and usesselections on inner SQL queries.GRainDB Graph Visualization: Our next scenario demonstratesthe benefits of visualizing parts of a relational database in the formof a graph. Our setup will be a contact tracing agent at WPHSwho has been given a report on an individual Mahinda who hastested positive but reported no one else during their interviewwho was positive and has not visited a place recently. The agentwas asked to analyze how the virus may have spread to Mahinda.The agent uses the graph visualization capability of GRainDB’sGVP. On the frontend we will ask a simple GRQL query to selectOngoing and Future WorkOur implementation of predefined joins inside DuckDB and GRQLand our browser frontend partially fulfills our goal of developingan extended RDBMS that is highly performant on graph workloadsand one that provides a set of graph modeling, querying, and visualization capabilities to users. We next describe our ongoing andfuture work.Factorization: In ongoing work we are modifying DuckDB’s queryprocessor to process queries in a factorized format instead of blocksof flat tuples. Several works have proposed approaches to integratefactorization into query processors both in the context of RDBMSsand GDBMSs. The FDB system [11] describes a relational queryprocessor that consists of operators that take as input tries andoutput tries. This architecture materializes full outputs and is inconflict with DuckDB’s vectorized and pipelined architecture. In arecent work done in our group, we described an implementation offactorized query processor, called list-based processor in the contextof the GraphflowDB GDBMS [15], that is also developed in ourgroup. Although this is a pipelined processor that uses operatorssimilar to DuckDB’s, it is designed assuming an in-memory systemand implements a limited form of factorization where only thelast edges of paths can be factorized. We are investigating ways5

(a) GRainDB Schema Designer (GSD).(b) GRainDB Visualization Panel (GVP).Figure 4: GRainDB Browser Frontend.Mahinda’s record as a node as in Figure 4b. This displays Mahinda’srecord as a node and immediately gives the attendees a graph viewon the database. The attendees will click on Mahinda’s node tosee their direct contacts, Carmen and Tamer, neither of whom areinfected (and no place nodes, since Mahinda has not visited placesrecently). Attendees will expand Carmen and Tamer to see theirneighborhoods. When expanding Tamer’s neighborhood, they willnotice that Tamer has recently visited a night club, which uponfurther expansion will reveal many infected visitors. Through thisexploratory graph analysis (but performed on top of an RDBMS),the attendees will formulate the hypothesis that Tamer may be anasymptomatic person who has spread the virus to Mahinda.GRainDB’s Predefined Joins: The goal of our last scenario isto demonstrate GRainDB’s query plans that implement sip-basedpredefined joins. We first describe attendees how GRainDB predefines the implicit joins defined for eContact and eVisit edges andbuilds the RID indices in the underlying DuckDB. The attendees willthen issue and profile queries that contain these predefined joinsby turning our predefined join optimization on and off througha flag we

Figure 1: Example relational database schema. Top row is the graph schema modeling part of the database also as a graph. Place contains information about places, such as universities and restaurants, that record their visitors for contact tracing. Visit