ProbClean

Modeling and Querying Possible Repairs of DataQuality Problem

University of Waterloo

 Overview

The quality of real world data is impaired by several errors and anomalies such as duplicate records, violation of integrity constraints, and missing values. The data cleaning process involves several uncertain aspects (e.g., uncertainty in determining whether two records are duplicates). In ProbClean, we leverage the awareness of such uncertainty by adopting a probabilistic approach to data cleaning. Specifically, we view data cleaning as a probabilistic process whose possible outcomes are multiple possible repairs. ProbClean provides several techniques for generating, storing and reasoning about possible repairs of data quality problems.

 


 1. Introduction

The quality of real world data is impaired by several errors and anomalies such as duplicate records, violation of integrity constraints, and missing values. Previous data cleaning approaches concentrate on producing a single clean database instance (i.e., a repair). However, the data cleaning process involves several uncertain aspects (e.g., uncertainty in determining whether two records are duplicates). We leverage the awareness of such uncertainty by adopting a probabilistic approach to data cleaning. Specifically, we view data cleaning as a probabilistic process whose possible outcomes are multiple possible repairs. ProbClean currently focuses on two data quality problems:

  • detection of duplicate records, and
  • violations of functional dependencies (FDs).

Figure 1

 

The goal of ProbClean is to efficiently generate possible repairs of various data quality problems, and to enable compact representation of such repairs. Furthermore, ProbClean allows efficient query answering using a set of possible repairs. Figure 1 illustrates the high-level architecture of ProbClean.

 2. Duplicate Detection

The goal of the duplicate dedection process is to identify tuples that refer to the same real-world entity, and merge them into one representative tuple. A repair of a given database instance is a clustering of the tuples such that tuples belonging to the same cluster are deemed duplicates. For example, in Figure 2, the left-hand side repair declares tuples 1 and 2 duplicates.

Figure 2

 

Unfortunately, noise in real-world data prevents accurate detection of duplicates. Therefore, we see the duplicate detection process as an uncertain process where the outcome is one possible clustering. For example, in Figure 2, an uncertain duplicate detection has three possible clusterings (i.e., repairs) of the input instance.

In order to allow efficent genertation and storage of possible repairs, we introduced the following contribution:

  • We explored multiple spaces of possible repairs and we presented a technique to efficiently generate all repairs that are valid outcomes of any fixed parameterized clustering algorithm.
  • We introduce an uncertainty model for representing the possible repairs generated by any fixed parameterized clustering algorithm. The model allows specifying cleaning parameters as query predicates by pushing the expensive record clustering operations to an offline construction step.
  • We describe how to evaluate relational queries under our model. We also propose new query types with online cleaning requirements that are not supported by current one-shot cleaning methods.
  • We show how to integrate our approach into relational DBMSs to allow storage of possible repairs and performing probabilistic query processing efficiently.
  • We also conduct an experimental study to evaluate the scalability and efficiency of our techniques under different configurations.


 3. Violations of Functional Dependencies

Functional dependencies (FDs) can be thought of as integrity constraints that encode data semantics. In that sense, violations of FDs indicate deviations from the expected semantics, possibly caused by data quality problems.

Figure 3

For example, in Figure 3, the FD ZIP-->State,City enforces the State and the City attributes of tuples that agree on the ZIP code to be the same. Clearly, the shown instance violates such an FD because tuples t2 and t3 have the same ZIP code and different City.

Because there are possibly multiple ways to fix a single violation of FD, it is unreasonable to present only one possible repair to the user. In our approach, we generate a sample of repairs from a specific space of possible repairs, namely cardinality-set-minimal repairs. Such space includes all repairs that represent minimal sets of changed cells. We introduced an efficient sampling technique that is based on repairing single cell at a time. We also provided several performance enhancements to reduce the time required to generate a sample of repairs.

Additionally, our approach allows specifying hard constraints over the sampling procedure such that repairs can only change a given set of cells, while keeping the remaining cells unchanged.

We introduced an experimental study to show the scalability of our repair sampling technique, compared to previous data cleaning approaches.