Transcription

Management of Probabilistic DataFoundations and Challenges Nilesh DalviDan SuciuUniversity of WashingtonSeattle, WAUniversity of WashingtonSeattle, [email protected] applications today need to manage large data setswith uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of theprobabilistic data model, present some fundamental theoretical result for query evaluation on probabilistic databases,and discuss several challenges, open problems, and researchdirections.Categories and Subject DescriptorsF.4.1 [Mathematical Logic]; G.3 [Probability and statistics]; H.2.5 [Heterogeneous databases]General TermsAlgorithms, Management, TheoryKeywordsprobabilistic databases, query processing1.UNCERTAINTIES IN THE FUTUREWe know well how to manage data that is deterministic. Databases were invented to support applications likebanking, payroll, accounting, inventory, all of which requirea precise semantics of the data. In the future we needto learn how to manage data that is imprecise, or uncertain, and that contains an explicit representation of the uncertainty. While data management has gone through several “paradigm shifts” in the past (NF2, OO databases,semistructured data), this one is arguably more fundamental: the foundations of the new paradigm no longer lie solely This research was supported in part by Suciu’s NSF CAREER grant IIS-0092955, NSF grants IIS-0415193, IIS0627585, IIS-0513877, IIS-0428168, and a gift from Microsoft.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.PODS’07, June 11–14, 2007, Beijing, China.Copyright 2007 ACM 978-1-59593-685-1/07/0006 . [email protected] Logic and finite model theory, but also in probabilistic inference [17], probability logic [2], and degrees of belief [8].One can find evidence in many areas that such a paradigmshift is needed. We mention here two such areas.Data Integration A central research topic for many years(see Halevy et al. [46]), data integration has reached thepoint where it needs to cope explicitly with uncertainties [82].Precise data mappings and schema mappings are increasingly hard to build on a large scale. As an example, in a recent paper in the journal Nature Reviews Genetics, Philippiand Kohler [66] discuss the difficulties in integrating lifescience databases, and the biggest obstacle they identify isimprecision in the data. Data mappings are difficult because there is no uniform conceptualization of the data, andin the databases themselves are incomplete (e.g. missingfunctional annotations), or uncertain (e.g. due to automatic,non-curated annotation). The common practice today is fora life-science researcher to manage uncertainties manually:they query one data source, say with a keyword, rank theanswers manually based on their quality or relevance to thequery, then feed the answers as queries into the next datasource, ranking those results etc. To do this automatically, asystem needs to represent and manipulate the uncertaintiesexplicitly, and use them to rank the data items. Representing uncertainties explicitly is a cheap, perhaps temporarysubstitute for more precise integration, say one based onontologies, a form of pay as you go data integration [35, 45,61].Novel Data We face increasingly large volumes of datagenerated by non-traditional means, which is almost alwaysimprecise. Google base [84], Flickr [83], the ESP game [79],and the Mechanical Turk [85] are examples of data generatedby the masses, who are often free to deviate from the representation guidelines suggested by the system. Large scaleinformation extraction systems [33, 31, 53] extract data automatically from text, and this data is always imprecise.What may soon overtake any other kind of data in volumeis data from the physical world, which originates from cheapsensors, cameras, RFID tags. There is emerging work tomake this data available on a World-Wide Sensor Web [11,32]. Sensor data is imprecise by its very nature and hasalready been modeled probabilistically [29, 30]. RFID deployments have high error rates, e.g. [54] reports read ratesin the 60 - 70% range, and the high-level data compoundedfrom such readings is also best modeled probabilistically [60,14, 56].The techniques that will allow us to manage data withuncertainties need to be grounded in a formalism that com-

eorem proverHardness [16]Algorithms [17]Hardness [18]Algorithms [63]DatabasesQuery evaluationPTIME for safequeries (Th. 3.9)PTIME for conjunctive queries (Th. 3.4)Figure 1: Inference (deterministic and probabilistic)in AI and in Databasesbines logic with some measure of the uncertainty. The AIcommunity has already gone through a transition from logicto logic with uncertainty, and it is natural to expect datamanagement to follow a similar path; after all, we are bothmodeling data and knowledge.A Paradigm for Managing UncertaintiesWe describe here a general paradigm for data with uncertainties. We make no assumptions about the source or thenature of the uncertainties: it could be that the data in thereal world is uncertain and the database simply records thisuncertainty, or it could be that the real data is deterministicbut our knowledge about it is uncertain for various reasons.The model simply assumes that the data is uncertain.The AI community has considered several approaches torepresenting uncertainties, e.g. rule-based systems, DempsterShafer, fuzzy sets, but by 1987 the probability model hasbeen dominant [48]. We only consider here modeling uncertainties as probabilities. Thus, in our model the data isprobabilistic, where the probabilities are internal measuresof the imprecision in the data. Users formulate queries using a standard query language, as if the data were precise.The system computes the answers, and for each answer computes an output probability score representing its confidencein that answer. These scores are then indicated somehow tothe user, for example by returning them explicitly, or byranking the answers in decreasing order of their scores, orby indicating the scores using some color code.The central problem discussed in this paper is the following: given a boolean query q on a probabilistic databaseP DB, compute the probability of q on P DB, in notationP(q). We consider only Boolean queries, because the probability of an answer a to a non-Boolean query q(x) can beobtained by evaluating the boolean query q[a/x] (substitutea for x). Our interest is in studying the data complexity [77]:fix q and study the complexity of computing P(q) as a function of the size of the probabilistic database P DB. We startby establishing a simple connection to the problem of computing the probability of a boolean expression Φ, P(Φ). Thishas been studied in the literature: it is #P-complete [75] andDNF formulas admit a FPTRAS (fully poly-time randomized approximation scheme) [55]. It follows that for everyfixed query q computing P(q) is in #P, and for every fixedconjunctive query q, computing P(q) admits a FPTRAS. Wethen study the data complexity for specific queries q, andshow that the set of conjunctive queries without self-joinshas a dichotomy: q is either in PTIME (and then we callit safe) or is #P-hard (and we call it unsafe), and, moreover, deciding between the two cases can be done in PTIMEin the size of q. Thus, a query processor can examine qand, either evaluate it exactly, when q is safe, or evaluate itapproximatively, when it is unsafe.While probabilistic databases have a long history, therehas been no systematic study of the query evaluation problem. Early systems have avoided addressing the full queryevaluation problem by either restricting the queries [9], orusing an exponential-time evaluation algorithm [36], or relying on a weaker semantics based on intervals [59].Probabilistic Inference v.s. Query EvaluationA problem related to query evaluation studied by the AIcommunity is inference in probabilistic networks1 . Probabilistic inference remains very expensive to this date [58].There is an important distinction between probabilistic inference in AI and query evaluation over probabilistic data,which mirrors the distinction between inference in knowledge bases and query evaluation in data bases. In AI there isno separation between the query and the data, and the complexity of the inference problem is formulated in terms of thesize of the probabilistic network Φ: both exact inference [16]and approximate inference [18] are hard. In database wehave two inputs, the query q and the probabilistic databaseP DB, and the system combines them at runtime to generDBate a probabilistic network ΦP. The complexity is meaqsured by fixing q and varying P DB: the latter is large, butusually has a simple probabilistic structure. As we saw,there are important tractable cases: for every conjunctivequery approximate inference is tractable [40], and for everysafe conjunctive query exact inference is tractable (Sec. 3).Thus, query evaluation on probabilistic databases is distinctfrom inference in probabilistic networks, in the same way inwhich query processing in relational databases is distinctfrom logical inference in knowledge bases, see Fig. 1.Two ApplicationsWe briefly illustrate how uncertainty in the data can bemodeled with probabilities in two applications. We referthe reader to [38, 74] for more examples.In fuzzy object matching the problem is to reconcile objects from two collections that have used different namingconventions. This is a central problem in data cleaning anddata integration, and has also been called record linkage, deduplication, or merge-purge [34, 3, 13, 37, 43, 49, 15, 80, 7].The basic approach is to compute a similarity score betweenpairs of objects, usually by first computing string similarityscores on their attributes, then combining these scores intoa global score for the pair of objects. Next, this score iscompared with a threshold, and each pair of objects is classified into a match, a non-match, and a maybe-match [3]. Réet al. [68] propose to convert these scores into probabilities2and store them directly in a probabilistic database. There isno more need for a threshold: instead all potential matchesare stored together with their probabilities, then used duringquery processing to rank the answers. Fig 2 (a) illustrates asmall fragment of such a match table between movie titlesfrom IMDB and reviews for DVD products from Amazon.An information extraction system processes a collectionof text documents and populates a user-defined schema [71].All approaches to extraction are imprecise, and usually associate a probability to each extraction. Gupta and Sarawagi [44]1The term probabilistic network is used in AI to denote agraph whose nodes represent random variables and whoseedges represent correlations [17]. Examples are: booleanexpressions, Bayesian Networks, and Markov Networks.2They are often already expressed as a probability.

Reproduced from [68]:ReviewMovie12 MonkeysTwelve Monkeys12 MonkeysTwelve Monkeys (1995)12 MonkeysMonkMonkey Love Twelve MonkeysMonkey Love Love Story(a)P0.40.30.0130.350.27Reproduced from [44]:.52 A Goregaon West Mumbai .ID House-No AreaCity152Goregaon West Mumbai152-AGoregaon West Mumbai152GoregaonWest Mumbai152-AGoregaonWest Mumbai27Westlake.(b)P0.10.50.20.3.Figure 2: Two applications of probabilistic data. Fuzzy object matching: each review matches severalcandidate movies (a). Information extraction: each address has several possible segmentations (b).describe how to use a probabilistic database to store the result of text segmentation with Conditional Random Fields:Fig. 2 (b) illustrates the possible segmentations of an address, together with their probabilities. Importantly, it hasbeen shown in [44] that the probabilities computed by CRFscorrelate with the probability that the extracted instance iscorrect, i.e. if one manually inspects all extracted items thathave probability between, say, 0.3 and 0.35, then one findsthat approximatively 30 35% of them are correct in a giveninstance. In other words, the probabilities have meaning.What these applications have in common is a need to storeand manage data with a probabilistic semantics. Probabilities are already present in these applications, thus we are notconcerned with where the probabilities are coming from. Instead, our goal is to allow users to ask queries over the data,and to compute the probability for each answer to the query.We give the definition of the data model next, then describethe query evaluation problem.space P DB (W, P) where the set of outcomes is a set ofpossible worlds W {I1 , . . P. , In }. In other words, it is afunction P : W (0, 1] s.t.I W P(I) 1.We review basic notions in probability theory in the Appendix. Fig. 3 illustrates three possible worlds of a probabilistic database: the probabilities of all worlds must sum upto 1. The intuition is that we have a database with schemaR(A, B, C, D), but we are not sure about the content of thedatabase: there are several possible contents, each with aprobability.A Boolean query q defines the event {I I q} over aprobabilistic database, and its marginal probability is P(q) Pof a BooleanI W I q P(I). A tuple t is a special casePquery and its marginal probability is P(t) I W t I P(I).Note that t 6 t0 and Key(t) Key(t0 ) implies P(t, t0 ) 0,i.e. t, t0 are disjoint events.Disjoint-Independent Databases In practice we cannot enumerate all possible worlds, since there are usually too2. THE POSSIBLE WORLDS DATA MODELmany. The representation problem for probabilistic databaseis an active research topic [26, 10, 41, 6, 5], and Green andWe review here the definition of a probabilistic databasebased on possible worlds, and of a disjoint-independent database. Tannen [41] observed a strong connection between representation systems for probabilistic databases and for incompleteThroughout this paper we restrict our discussion to reladatabase. We restrict our discussion to disjoint-independenttional data over a finite domain: extensions to continuousdatabases, which have a simple representation.domains [29] and to XML [50, 1, 76] have also been considA possible tuple is a tuple that occurs in at least one possiered.ble world, and we denote T the set of possible tuples. P DBWe fix a relational schema R (R1 , . . . , Rk ), where Riis disjoint-independent if any set of possible tuples withis a relation name, has a set of attributes Attr(Ri ), and adistinct keys is independent: t1 , . . . , tn T , Key(ti ) 6 key Key(Ri ) Attr(Ri ); we explain below why we includeKey(tj ) for i 6 j implies P(t1 , . . . , tn ) P(t1 ) · · · P(tn ). Akeys in our data model. Denote D a finite domain of atomicdisjoint-independent database is represented by enumeratvalues. For i 1, . . . , k denote T upi the set of typed atomicing the set of possible tuples T and their marginal probatuples of the form Ri (a1 , . . . , ak ) where a1 , . . . , ak SD andbilities P(t): this uniquely determines the probability P(I)k Attr(Ri ) is the arity of Ri , and denote T up i T upiof each instance I T 3 . We denote a disjoint-independentthe domain of all tuples. If t T upi then Key(t) repredatabase as P DB (T, P), rather than (W, P), emphasents a typed tuple of arity Key(Ri ) consisting of the keysizing that it is given by enumerating the set of possibleattributes in t. A database instance is any subset I T uptuples rather than the possible worlds. Its size is defined asthat satisfies all key constraints.the size of T . Fig. 4 shows the representation of a disjointTo illustrate, R(A, B, C), S(A, D), T (A, B, C) is a schema,independent database: it has 16 possible worlds, and threethe underlined attributes are the keys, and t1 R(a, b, c),of them are precisely those shown in Fig. 3. There are sevent2 R(a, b, c0 ), t3 T (a, b, c) are three distinct atomic tupossible tuples, each with some probability and it is conples, which we sometimes write as R(a, b, c) to emphasizevenient to group the possible tuples by their keys, A, B,the key attributes. We have Key(t1 ) Key(t2 ) 6 Key(t3 ):to emphasize that at most one can be chosen in each group.the latter because we consider Key(t) to be a typed tuple.The two tables in Fig. 2 are also disjoint-independent. FromThe main idea in a probabilistic database is that the statenow on, throughout the paper a database means a disjointof the database, i.e. the instance I is not known. Insteadindependent probabilistic database. We call the databasethe database can be in any one of a finite number of possible states I1 , I2 , . . ., called possible worlds, each with some3P(I) is a product with one factor for each key k in T :probability.if some tuple t PI has key k, then that factor is P(t);otherwise it is 1 t T,Key(t) k P(t).Definition 2.1. A probabilistic database is a probability

I1Aa1a2a2Bb1b1b2Cc1c3c4Dd1d1d2P(I1 ) 0.06( p1 p3 p6 )I2Aa1a2a2Bb1b1b2Cc2c2c4Dc2c1c2P(I2 ) 0.12( p2 p5 p6 )I3Aa1a2Bb1b2Cc1c4Dd1d2P(I3 ) 0.04( p1 (1-p3 -p4 -p5 )p6 )Figure 3:A probabilistic database P DB ({I1 , I2 , I3 , . . .}, P) with schema R(A, B, C, D); we showonly three possible worlds.independent if Key(Ri ) Attr(Ri ) for all relation symbolsRi , i.e. there are no disjoint tuples.Uncertainty at the Attribute Level Many probabilistic database systems [9, 59, 29, 30, 70, 5] need to captureuncertainties at the attribute rather than the tuple level.For example, in a table S(A, B) the A-value of some tuple isknown to be a1 , but the B-value attribute can be b1 , b2 , b3 ,with probabilities p1 , p2 , p3 : we must have p1 p2 p3 1since it is certain that a tuple of the form (a1 , ) is in thedatabase, only the value of B is uncertain. There existsa simple mapping from attribute-level uncertainties to adisjoint-independent database, as illustrated below:A B PA Ba1 b1 p1a1 b2 p2a1 hb1 , p1 i, hb2 , p2 i, hb3 , p3 ia1 b3 p3a2 hb2 , p4 i, hb9 , p5 i, . . .a2 b2 . . .Note that we had to make A a key to ensure that the threevalues for B are disjoint. The reason why we have keys inour data model is to represent attribute-level uncertainties.All the hardness results in Sec. 3 extend easily to attributelevel uncertainties.APp-or-set [41] is a database where for every possible keyk, t T,Key(t) k P(t) 1. S above is a p-or-set.3.FOUNDATIONS OF QUERY EVALUATIONWe study the query evaluation problem: given a query qand a disjoint-independent database P DB, compute P(q).We first describe the connection between this problem andthe probability of Boolean formulas, which has been studiedin the literature; in particular it follows that for any fixedq, computing P(q) is in #P in the size of P DB. We thenanalyze the precise complexity of P(q) for a fixed query,and establish a dichotomy for conjunctive queries withoutself-joins: every query is either #P-hard or in PTIME.From Queries to Boolean FormulasConsider n Boolean variables X̄ {X1 , . . . , Xn }. A disjointindependent probability space with outcomes 2X̄ (the set oftruth assignments) is given by a partition X̄ X̄1 . . . X̄mandP a probability function P : X̄ (0, 1] s.t. j 1, m,X X̄j P(X) 1. A truth assignment is chosen at randomby independently choosing in each set X̄j at most one variable Xi that will be set to true, while all others are set tofalse: namely Xi is chosen with probability P(Xi ). We willassume that all probabilities are given as rational numbers,P(Xi ) pi /qi , i 1, n, where pi , qi are integers.We call this probability space independent if j, X̄j p3p4p5p6p7 0.25 0.75 0.3 0.3 0.2 0.8 0.2Figure 4: Representation of a disjoint-independentprobabilistic database. The seven possible tuples aregrouped by their keys, for readability. There are 16possible worlds; three are shown in Fig. 3.(i.e. there are no disjoint variables), and uniform if it isindependent and i, P(Xi ) 1/2.Let Φ be a Boolean formula over PX1 , . . . , Xn ; the goal isto compute its probability, P(Φ) θ 2X̄ :θ(Φ) P(θ).The complexity class #P consists of problems of the following form: given an NP machine, compute the number ofaccepting computations [64]. Let #Φ denote the number ofsatisfying assignments for Φ. Valiant [75] has shown thatthe problem: given Φ, compute #Φ, is #P-complete.A statement like “computing P(Φ) is in #P” is technically non-sense, because computing P(Φ) is not a countingproblem. However, as we show in the Appendix, there existsa polynomial-time computable function F over the integersp1 , q1 , . . . , pn , qn , s.t. computing F · P(Φ) is in #P. For example, in the case of a uniform distribution, take F 2n ,and computing 2n P(Φ) #Φ is in #P. In this sense:Theorem 3.1. Computing P(Φ) is in #P.While computing P(Φ) for a DNF Φ is still #P-hard, Lubyand Karp [55] have shown that it has a FPTRAS (fully polytime randomized approximation scheme). More precisely:there exists a randomized algorithm A with inputs Φ, ε, δ,which runs in polynomial time in Φ , 1/ε, and 1/δ, and returns a value p̃ s.t. PA ( p̃/p 1 ε) δ. Here PA denotesthe probability over the random choices of the algorithm.Grädel et al. [40] show how to extend this to independentprobabilities, and we show in the Appendix how to extendit to disjoint-independent probabilities:Theorem 3.2. Computing P(Φ) for a disjoint-independentprobability space P and a DNF formula Φ has a FPTRAS.We now establish the connection between the probabilityof Boolean expressions and the probability of a query ona disjoint-independent database. Let P DB (T, P) be adatabase with possible tuples T {t1 , . . . , tn }. Associate aBoolean variable Xi to each tuple ti and define a disjointindependent probability space on their truth assignments bypartitioning the variables Xi according to their keys (Xi , Xjare in the same partition iff Key(ti ) Key(tj )), and bydefining P(Xi ) P(ti ). This creates a one-to-one correspondence between the possible worlds of P DB and thetruth assignments 2X̄ , which preserves the probabilities.Consider a Boolean query, q, expressed in First OrderLogic (FO) over the vocabulary R. The intensional formulaassociated to q and database P DB is a Boolean formulaDBΦP, or simply Φq when P DB is understood from theqcontext, defined inductively as follows:

ΦR(ā) Φtrue Φq1 q2 Φ x.q(x)Xif alsetrueif R(ā) ti Tif R(ā) 6 TΦf alse f alseΦ q (Φq )Φq1 Φq2 Φq1 q2 Φq1 Φq2 Φq[a/x] Φ x.q(x) Φq[a/x]a Da DHere D represents the active domain of P DB, q[a/x] denotes the formula q where the variable x is substituted witha, and interpreted predicates over constants (e.g. a b ora b) are replaced by true or f alse respectively. If q has vvariables, then the size of Φq is O( q · D v ). The connectionbetween Boolean formulas and Boolean queries is:Proposition 3.3. For every query q and any databaseP DB (T, P), P(q) P(Φq ).A Boolean conjunctive query is restricted to the connectives and , and, as usual, we write it as:q g1 , g2 , . . . , gk(1)where each gi is a positive relational predicate called a subgoal. Obviously, in this case Φq is a positive DNF expression. For a simple illustration, suppose q R(x), S(x, y)and that we have five possible tuples: t1 R(a), t2 R(b),t3 S(a, c), t4 S(a, d), t5 S(b, d) to which we associate the Boolean variables X1 , X2 , Y1 , Y2 , Y3 , then Φq X1 Y1 X1 Y2 X2 Y3 . Our discussion implies:Theorem 3.4. For a fixed a Boolean query q, computingP(q) on a disjoint-independent database (1) is in #P, (2)admits a FPTRAS when q if conjunctive [40].Grädel et al. [40] stated that computing P(q) is in FP#P(the class of functions computable by a PTIME machinewith a #P oracle). They also showed that conjunctive querieshave a FPTRAS over independent databases.A Dichotomy for Queries without Self-joinsWe now establish the following dichotomy for queries without self-joins: computing P(q) is either #P-hard or is inPTIME in the size of the database P DB (T, P). A queryq is said to be without self-joins if each relational symboloccurs at most once in the query body [21, 20]. For exampleR(x, y), R(y, z) has self-joins, R(x, y), S(y, z) has not.Three Hard Queries We prove in the Appendix:The significance of these three (classes of) queries is thatthe hardness of any other conjunctive query without selfjoins follows from a simple reduction from one of these three(Lemma 3.7). By contrast, the hardness of these threequeries is shown directly (by reducing Positive Partitioned 2DNF [67] to h1 , and PERMANENT [75] to h 2 , h3 ) andthese proofs are more involved.Previously, the complexity has been studied only for independent probabilistic databases. De Rougemont [27] claimedthat it is is in PTIME. Grädel at al. [27, 40] corrected thisand proved that the query R(x), R(y), S1 (x, z), S2 (y, z) is#P-hard, by reduction from regular (non-partitioned) 2DNF:note that this query has a self-join (R occurs twice); h1does not have a self-join, and was first shown to be #Phard in [21]. The only prior discussion of the complexityon disjoint-independent databases is in [20]: we announced(without proof) the hardness of h2 and of4 R(x, y), S(z, y),but missed h3 as well as the -variants.A PTIME Algorithm We describe here an algorithmthat evaluates P(q) in polynomial time in the size of thedatabase, which works for some queries, and fails for others. We need some notations. V ars(q) and Sg(q) are theset of variables, and the set of subgoals respectively. Ifg Sg(q) then V ars(g) and KV ars(g) denote all variablesin g, and all variables in the key positions in g: e.g. forg R(x, a, y, x, z), V ars(g) {x, y, z}, KV ars(g) {x, y}.For x V ars(q), let sg(x) {g g Sg(q), x KV ars(g)}.Given a database P DB (T, P), D is its active domain.Algorithm 3.1 computes P(q) by recursion on the structure of q. If q consists of connected components q1 , q2 , thenit returns P(q1 )P(q2 ): this is correct since q has no selfjoins, e.g P(R(x), S(y, z), T (y)) P(R(x))P(S(y, z), T (y)).If some variable x occurs in a key position in all subgoals,thenQit applies the independent-project rule: e.g. P(R(x)) 1 a D (1 P(R(a))) is the probability that R is nonempty.For another example, we apply an independent project onx in q R(x, y), S(x, y): this is correct because q[a/x] andq[b/x] are independent events whenever a 6 b. If there existsa subgoal g whose key positions are constants, then it applies a disjoint project on any variable in g: e.g. x is such avariable in q R(x, y), S(c, d, x), and any two events q[a/x],q[b/x] are disjoint because of the S subgoal.We illustrate the algorithm on the query below:q R(x), S(x, y), T (y), U (u, y), V (a, u)XP(q) P(R(x), S(x, y), T (y), U (b, y), V (a, b))b DTheorem 3.5. For each of the queries below (where k, m 1), computing P(q) is #P-hard in the size of the database:h1 R(x), S(x, y), T (y)h 2 R1 (x, y), . . . , Rk (x, y), S(y)h 3 R1 (x, y), . . . , Rk (x, y), S1 (x, y), . . . , Sm (x, y) XP(R(x), S(x, y), T (y), U (b, y))P(V (a, b))b D XXP(R(x), S(x, c), T (c), U (b, c))P(V (a, b))b D c D XXP(R(x), S(x, c))P(T (c))P(U (b, c))P(V (a, b))b D c DThe underlined positions represent the key attributes (seeSec. 2), thus, in h1 the database is tuple independent, while in h 2 , h3 it is disjoint-independent. When k m 1 thenwe omit the superscript and write:h2h3 R(x, y), S(y)R(x, y), S(x, y) XXb D c D(1 Y(1 P(R(d))P(S(d, c)))) ·d DP(T (c))P(U (b, c))P(V (a, b))We call a query safe if algorithm Safe-Eval terminatessuccessfully; otherwise we call it unsafe. Safety is a property4This query rewrites h2 hence is hard by Lemma 3.7.

Algorithm 3.1 Safe-EvalInput: query q and database P DB (T, P)Output: P(q)1: Base Case: if q R(ā)return if R(ā) T then P(R(ā)) else 02: Join: if q q1 , q2 and V ars(q1 ) V ars(q2 ) return P(q1 )P(q2 )3: IndependentQ project: if sg(x) Sg(q)return 1 a D (1 P(q[a/x]))4: DisjointPproject: if g(x V ars(g), KV ars(g) )return a D P(q[a/x])5: Otherwise: FAILCall a query q final if it is unsafe, and q 0 , if q q 0 then q 0is safe. Clearly every unsafe query rewrites to a final query:simply apply repeatedly until all rewritings are to safequeries. We prove in the Appendix: Lemma 3.8. h1 , h 2 , h3 are the only final queries.This implies immediately the dichotomy property:Theorem 3.9. Let q be a query without self-joins. Thenone of the following holds: q is unsafe and q rewrites to one of h1 , h 2 , h3 . Inparticular, q is #P-hard. q is safe. In particular, it is in PTIME.that depends only on the query q, not on the database P DB,and it can be checked in PTIME in the size of q by simplyrunning the algorithm over an active domain of size 1, D {a}. Based on our previous discussion, if the query is safethen the algorithm computes the probability correctly:Proposition 3.6. For any safe query q, the algorithmcomputes correctly P(q) and runs in time O( q · D V ars(q) ).We first described Safe-Eval in [20], in a format moresuitable for an implementation, by translating q into an algebra plan using joins, independent projects, and disjointprojects, and stated without proof the dichotomy property.Andritsos et al. [4] describe a query evaluation algorithm fora more restricted class of queries.The Dichotomy Property We define below a rewriterule q q 0 between two queries, where g, g 0 denote subgoals:qqqq, gq, g q[a/x]q1q[y/x]qq, g 0if x V ars(q), a Dif q q1 , q2 , V ars(q1 ) V a

The model simply assumes that the data is uncertain. The AI community has considered several approaches to representing uncertainties, e.g. rule-based systems, Dempster-Shafer, fuzzy sets, but by 1987 the probability model has been dominant [48]. We only consider here modeling un-certainties as probabilities. Thus, i