Improving Data Quality by Leveraging Statistical Relational Learning

Visengeriyeva, L and Akbik, A and Kaul, Manohar and Rabl, T and Markl, V (2016) Improving Data Quality by Leveraging Statistical Relational Learning. In: International Conference on Information Quality, 22-23 June, 2016, Ciudad Real, Spain.

Preview

Text
larysa.pdf - Accepted Version
Download (1MB) | Preview

Abstract

Digitally collected data su ↵ ers from many data quality issues, such as duplicate, incorrect, or incomplete data. A common approach for counteracting these issues is to formulate a set of data cleaning rules to identify and repair incorrect, duplicate and missing data. Data cleaning systems must be able to treat data quality rules holistically, to incorporate heterogeneous constraints within a single routine, and to automate data curation. We propose an approach to data cleaning based on statistical relational learning (SRL). We argue that a formalism - Markov logic - is a natural fit for modeling data quality rules. Our approach allows for the usage of probabilistic joint inference over interleaved data cleaning rules to improve data quality. Furthermore, it obliterates the need to specify the order of rule execution. We describe how data quality rules expressed as formulas in first-order logic directly translate into the predictive model in our SRL framework.

[error in script]

IITH Creators: