Giter VIP home page Giter VIP logo

cleanex's Introduction

CleanEx

Explaining data cleaning pipelines

CleanEx

CLeanEX is a solution implemented in Python that can generate explanations for the selected data cleaning pipelines that are generated automatically by an automated cleaning system. Given a large cleaning pipeline search space, CLeanEX propses meaningful explanatory features to describe the pipelines and generate predicate-based explanation rules. It computes quality indicators for these explanations and propose a multi-objective optimization algorithm to select the optimal set of explanations for user-defined objectives.

Our preliminary experiments show the need for multi-objective optimization for the generation of high-quality explanations that can be either intrinsic to the single selected cleaning pipeline or relative to the other data cleaning and preprocessing pipelines that may not have been selected by the automated cleaning system.

We also show that CLeanEX is a promising step towards generating automatically insightful explanations, while catering to the needs of the user alike.

For more details, please refer to our paper:

  • Laure Berti-Equille and Ugo Comignani. Explaining Automated Data Cleaning with CLeanEX. Proceedings of IJCAI-PRICAI 2020 Workshop on Explainable Artificial Intelligence (XAI). Preprint

Installation and Usage

1. Download or clone CleanEx repo

2. If needed, please install the required packages

Using pip: numpy, scipy, functools, skcriteria, skcriteria.madm, pandas, matplotlib.pyplot, math, os, sys, csv

3. Run CleanEX using the following command line from your terminal

python cleanex.py ./experiments/treeStruct.csv ./experiments/treeFeatures.csv "root" -o ./experiments/out.csv -f "n7" -m "1,0,0,0"

where:

  • two input files are required and placed in the ./experiments folder: they describe the full cleaning pipeline (with all alternative cleaning strategies explored with an automated data curation system): 1) treeStruct.csv stores the structure of the cleaning pipeline space represented as a tree, and 2) treeFeatures.csv gives the description of each node (or step of each cleaning strategy) with the following headers:
    • cost: The normalized cost of the cleaning strategy
    • data quality improvement (dq_imp): The percentage of data quality problems solved by the pipeline (e.g., remove 100% of missing values by imputation)
    • distortion (dist): The statistical distortion as the Mahalanobis distance between the original and cleaned version of the data set
    • satisfaction (sat): The satisfaction of ML model requirements by the pipeline defined as a Boolean: e.g., for regression, satisfaction equals 1 if linearity, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity constraints are satisfied by the cleaned data set
    • corr_ratio: The fraction of the number of pipelines sharing the same tasks over the sum of their respective ranks and the total number of explored pipelines, and
    • non_corr_ratio: The fraction of the number of pipelines that do not share the same task over the sum of their respective ranks and the total number of explored pipelines
  • the output file is out.csv in the ./experiments folder using -o option
  • the generation of explanations starts from the root to the leaf node n7 (with -f option for specifying the final node)
  • the multi-objective optimization is defined 1,0,0,0 and indicates that the only criterion considered for optimization is polarity.

Cleanex considers four dimensions of quality for the generated explanations: polarity, distancing, surprise, and diversity of the explanations (See our paper for a formal definition of each dimension).

4. Check the output file

Finally, CleanEx select the optimal explanations of the specified data cleaning strategy (n7) and store them in the output file out.csv. The kind of explanation rules we can obtain are the following:

P6: succ(root,n7)
C58: increase(cost,root,n7) /\ delta(cost,root,n7,0.332)
C71: most([corr_ratio,dq_imp],root)
C81: least([non_corr_ratio],root)
C160: least([sat],n7)

Basically, what it means, in natural language is that the choice of the cleaning strategy from root to n7 (P6) is best explained by polarity (favouring the extreme explanations). More specifically: despite the increasing cost of the cleaning strategy (C58), the data quality improvement and the number of cleaning tasks commonly shared with all the other pipelines are maximal (C71), the number of uncommon tasks (C81) and the satisfaction are minimal (C160).

5. Visualization

You can also visualize the explanation quality as radar plots.

Radar


Licence

CLeanEx and the content of this repository are licensed under the BSD 2-Clause.

cleanex's People

Contributors

laureberti avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.