Both data as well as information derived from it has an increasing impact on companies, since many short- and longterm decisions are based on the interpretation of data. In addition to the long established relational databases, unstructured or semi-structured data sources are gaining importance. Examples of such unstructured data are Weblogs, data from social networks or RFID-data. Therefore, the relevance of data quality monitoring is increasing enormously for these kind of data sources, in order to prevent wrong decisions based on faulty analysis due to flawed data.
Main target of this joint project is the study of advanced concepts for data quality monitoring of heterogeneous data sources. The project will result in a prototypical tool for data quality analysis. This tool mainly consists of components for semi-automated detection of data quality rules, as well as the automated monitoring of data quality rules.
Focus is laid on the area of detection and derivation of data quality rules. One goal is to automatically detect complex statistical dependencies and suggest rules based on the results. The suggested rules will be presented using a data quality rule language, which will be constructed in the scope of this project. The data quality manager supervises these automatically created rules and may adjust them. This is introduced in order to improve the rule detection continuously.
A central data quality repository will be introduced, which contains data quality rules as well as metadata on rule checking executions. Besides the semi-automatically detected rules, the data quality manager may insert manually created data quality rules.
The permanent data quality monitoring uses the rules from the central data quality repository and regularly checks those against current database content. Deviations will be logged and reported.
You can find more details in the published papers (see publications; a longer article is available here). Further details can be found in the concepts document.