The URank project introduces new probabilistic formulations for ranking and ranking-aggregate queries in probabilistic databases. The formulations are based on marriage of traditional ranking semantics with possible worlds semantics. In the light of these formulations, we have designed novel techniques to support different probabilistic ranking queries, and leverage existing query processing and indexing capabilities in current RDBMSs to recognize and handle data uncertainty in score-based ranking.
1. Introduction
Ranking and aggregation queries are widely used in data exploration, data analysis and decision making scenarios. Ranking queries (also referred to as Top-k Queries) report the top-ranked tuples in the database based on scores computed by a given scoring function defined on one or more scoring predicates (e.g., functions defined on one or more database columns). A scoring function induces a unique total order on database tuples. Ranking-aggregate queries (also referred to as Top-k Aggregate Queries) rank groups of tuples based on their aggregate score values (e.g., average score) and return the top-ranked groups.
While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications (e.g., Web mashups, location-based services, and sensor data management) involve data that is unclean or uncertain.
For example, in the context of the Web, information extraction techniques are used to extract instances of entities, e.g., organization and person names. The imperfection of extraction tools and the inherent ambiguity of unstructured text introduce uncertainties in the extracted information. Ranking under such uncertain information based on some scoring measure is important to a wide range of applications. For example, a popularity survey may require ranking movies based on extracted ratings from online movie reviews. In the context of sensor networks, sensor readings are usually represented using probabilistic models, derived from readings history and sensor correlations. Many applications can benefit from aggregating sensor data. For example, grouping sensor readings by location to find the top-k locations based on average temperature.
We consider two different types of data uncertainty:
- Tuple level uncertainty: tuples may belong to the database with less than absolute confidence. A widely-used model to capture this type of uncertainty is representing tuples as probabilistic events, and model the database as a joint distribution defined on these events.
- Value level uncertainty: tuple attributes may have uncertain values. A widely-used model to capture this type of uncertainty is representing tuple attributes as probability distributions defined on discrete or continuous domains.
Under both types of uncertainty, ranking and aggregation queries have intriguing formulations due to the possibility of having multiple possible ranked instances (possible worlds) of the same database originating from tuple/value uncertainties. Ranking and aggregating uncertain (probabilistic) data raises new challenges in query semantics and processing, making conventional methods inapplicable. Furthermore, uncertainty imposes probability as a new ranking dimension that does not exist in the traditional settings.
The following example illustrates the challenges involved in formulating and computing ranking queries under tuple level uncertainty:
Example 1. In a traffic-monitoring system, radars detect cars' speeds automatically, while car identification, e.g., by plate number, is performed by a human operator, or OCR of plate number images. In this system, multiple sources contribute to data uncertainty, e.g., high voltage lines that interfere with radars affecting their precision, closeby cars that cannot be distinguished, or plate number images that are not clear enough to identify the car precisely. Figure 1 shows a snapshot of speed Readings relation in the last hour. The special attribute Prob in each tuple indicates the probability that the whole tuple carries correct information. This probability can be obtained from various sources. For example, history of previous readings might indicate that 70% of the readings obtained from radars close to high voltage lines are actually correct. Hence, the readings of a radar unit that is close to high voltage lines can be assumed to be correct with probability 0.7. Other sources, e.g., clearness of plate number images, can be additionally incorporated to better quantify tuple's uncertainty.
![]() |
| Figure 1: A relation with tuple uncertainty and example ranking and aggregation queries |
Figure 1 shows two queries to be evaluated on the Readings relation in Example 1. Query 1 requests the top-k speeding cars in one hour interval, which can be used, e.g., in an accident investigation scenario, while Query 2 requests the top-k locations based on average speed, which can be used, e.g., to recommend the most suitable locations of speed traps. While both queries are clear in deterministic databases, we show that their interpretation and processing is challenging for probabilistic databases.
In Query 1, although tuple score (the Speed attribute) is deterministic, the tuples are probabilistic. Both tuples' probabilities and scores need to be factored in our interpretation of this query. This effectively introduces two interacting ranking dimensions that interplay to decide meaningful query answers. For example, it is not meaningful to report a top-scored tuple with insignificant probability. Moreover, combining scores and probabilities into one measure, using some aggregation function, eliminates uncertainty completely, and loses valuable information that can be used to get more meaningful answers conforming with probabilistic query models.
In the case of aggregation, additional challenges arise. In Query 2, each location (group) involves a number of probabilistic tuples. For example, for location L1, there is a chance that a Honda and a Toyota tuples are correctly recorded, only one of them is correctly recorded, or none of them is correctly recorded. To compute Average(Speed), the aggregate value of each group, we need to take all possibilities into account. This means that we have, for each group, a probability distribution over its aggregate value. Computing such probability distribution is NP. This is a clear departure from the conventional semantics of aggregate queries, where the aggregate function gives a single deterministic value for each group.
Figure 2 shows a snapshot of actual search results reported by apartments.com for a simple search for available apartments to rent. The shown search results include several uncertain pieces of information. For example, some apartment listings do not explicitly specify the deposit amount. Other listings mention apartment rent and area as ranges rather than single values. The obscure data in Figure 2 may originate from different sources including the following:
- Data entry errors, for example, an apartment listing is missing the number of rooms by mistake.
- Integrating heterogeneous data sources, for example, listings are obtained from sources with different schemas.
- Privacy concerns, for example, zip codes are anonymized.
- Marketing policies, for example, areas of small-size apartments are expressed as ranges rather than precise values.
- Presentation style, for example, search results are aggregated to group similar apartments.
![]() |
| Figure 2: Uncertain data on the Web |
Uncertainty introduces new challenges regarding both the semantics and processing of ranking queries. We illustrate such challenges using the following example:
Example 2. Assume an apartment database. Figure 3 gives a snap shot of the results of some user query posed against such database. Assume that the user would like to rank the results using a function that scores apartments based on rent (the cheaper the apartment, the higher the score).Since the rent of apartment a2 is specified as a range, and the rent of apartment a4 is unknown, the scoring function assigns a range of possible scores to a2, while the full score range [0 -10] is assigned to a4.
![]() |
| Figure 3: Partial order of records with uncertain scores |
Figure 3 also depicts a diagram for the partial order induced by apartment scores. Disconnected nodes in the diagram indicate the incomparability of their corresponding records. Due to the intersection of score ranges, Apartment a4 is incomparable to all other records, and Apartment a2 is incomparable to Apartment a3.
One intuitive method to rank the objects involved in a partial order is inspecting the space of possible rankings that conform to the relative order of objects. These rankings are called the linear extensions of the partial order. For example, Figure 3 shows all linear extensions of the given partial order. Inspecting the space of linear extensions allows ranking the objects in a way consistent with the partial order. For example, a1 may be preferred to a4 since a1 appears at Rank 1 in 8 out of 10 linear extensions. A crucial challenge for such approach is that the space of linear extensions grows exponentially in the number of objects.
Furthermore, in many scenarios, uncertainty is quantified probabilistically. For example, a moving object's location can be described using a probability distribution defined on some region based on location history. Similarly, a missing attribute can be filled in with a probability distribution of multiple imputations, using machine learning methods. Augmenting uncertain scores with such probabilistic quantifications generates a (possibly non-uniform) probability distribution of linear extensions that cannot be captured using a standard partial order or dominance relationship.
Our adopted ranking models are based on possible worlds semantics, where a probabilistic database is viewed as a set of possible worlds that represents an enumeration of all possible instances of the database.
Under Tuple Level Uncertainty, each world is a subset of database tuples. The principles of probability theory are used to support this view by modeling tuples as probabilistic events. Specifically, a tuple t is associated with an event t.e, such that t exists in the database with probability Pr(t.e), and does not exist in the database with probability Pr(NOT t.e) =1 - Pr(t.e). Possible worlds are thus viewed as conjunctions of tuple events.
Possible worlds probabilities are determined based on the probabilistic correlations among tuples, e.g., mutual exclusion of tuples that map to the same real world entity. We call such correlations Generation Rules, since they control how the possible worlds space is generated. Such rules could naturally arise with unclean data, or could be enforced to satisfy application requirements or reflect domain semantics. Moreover, the relational processing of probabilistic tuples induces correlations among intermediate query results, even when base tuples are uncorrelated.
Figure 4 shows the Readings relation, from Example 1, augmented with generation rules that enforce the following requirement: based on radar locations, the same car cannot be detected at two different locations within 1 hour interval. Figure 4 shows the possible worlds, and their probabilities. Each world can be seen as a joint event of the existence of world's tuples, and the absence of all other database tuples. The probability of this joint event is determined by tuple probabilities, and the generation rules that correlate them.
![]() |
| Figure 4: Ranked possible worlds of Relation Readings based on Speed attribute |
Ranking each possible world based on tuple scores gives different possible rankings of the database. Computing ranking queries requires reasoning about these different rankings to produce an overall meaningful tuple ranking.
Under Value Level Uncertainty, the set of possible worlds is a set of database duplicates (i.e., no tuples are missing). These duplicates can have different rankings due to score uncertainty.
We adopt a general representation of uncertain scores, where the score of tuple t is modeled as a probability density function defined on a real interval [lo(t),up(t)]. A deterministic (certain) score is modeled as an interval with equal bounds, and a probability of 1.
The interval-based score representation induces a partial order over tuples, which extends the conventional definition of strict partial orders by creating a probabilistic dominance relation among tuples with intersecting score intervals, and a strict dominance relation among tuples with non-intersecting score intervals. We call such partial order a Probabilistic Partial Order (PPO for short).
Figure 5 shows the Hasse diagram of an example PPO of records with uniform score densities. The strict dominance relationships are indicated using arrows, while the probabilistic dominance relationships are given in the set P. We also show the set of linear extensions of the PPO. The linear extensions of a PPO can be viewed as tree where each root-to-leaf path is one linear extension. The root node is a dummy node since there can be multiple tuples that may be ranked first. Each occurrence of a tuple t in the tree represents a possible ranking of t, and each level i in the tree contains all tuples that occur at rank i in any linear extension.
![]() |
| Figure 5: Probabilistic partial order and linear extensions |
Due to probabilistic dominance relationships, the space of possible linear extensions is a probability space generated by a probabilistic process that draws, for each record t , a random score value in [lo(t),up(t)] based on the score density of t. Ranking the drawn score values gives a total order of database tuples, where the probability of such order is the joint probability of the drawn scores. For example, we show in Figure 5, the probability value associated with each linear extension, which is computed by evaluating the shown nested integral.
Similar to Tuple Level Uncertainty, computing ranking queries under Value Level Uncertainty requires reasoning about the possible linear extensions of the underlying PPO to produce an overall meaningful tuple ranking.




