Giter VIP home page Giter VIP logo

desbordante's Introduction

General

Desbordante is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms. The currently supported data patterns are:

  • Functional dependencies, both exact and approximate (discovery and validation)
  • Conditional functional dependencies (discovery)
  • Metric functional dependencies (validation)
  • Fuzzy algebraic constraints (discovery)
  • Association rules (discovery)

The discovered patterns can have many uses:

  • For scientific data, especially those obtained experimentally, an interesting pattern allows to formulate a hypothesis that could lead to a scientific discovery. In some cases it even allows to draw conclusions immediately, if there is enough data. At the very least, the found pattern can provide a direction for further study.
  • For business data it is also possible to obtain a hypothesis based on found patterns. However, there are more down-to-earth and more in-demand applications in this case: clearing errors in data, finding and removing inexact duplicates, performing schema matching, and many more.
  • For training data used in machine learning applications the found patterns can help in feature engineering and in choosing the direction for the ablation study.
  • For database data, found patterns can help with defining (recovering) primary and foreign keys, setting up (checking) all kinds of integrity constraints.

Desbordante can be used via three interfaces:

  • Console application. This is a classic command-line interface that aims to provide basic profiling functionality, i.e. discovery and validation of patterns. A user can specify pattern type, task type, algorithm, input file(s) and output results to the screen or into a file.
  • Python bindings. Desbordante functionality can be accessed from within Python programs by employing the Desbordante Python library. This interface offers everything that is currently provided by the console version and allows advanced use, such as building interactive applications and designing scenarios for solving a particular real-life task. Relational data processing algorithms accept pandas DataFrames as input, allowing the user to conveniently preprocess the data before mining patterns.
  • Web application. There is a web application that provides discovery and validation tasks with a rich interactive interface where results can be conveniently visualized. However, currently it supports a limited number of patterns and should be considered more as an interactive demo.

A brief introduction into the tool and its use cases is presented here (in English) and here (in Russian). Also, a list of various articles and guides can be found here.

Console

Usage examples:

  1. Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default FD discovery algorithm (HyFD) is used.
python3 cli.py --task=fd --table=../examples/datasets/university_fd.csv , True
( 1 3 ) -> 0
( 1 3 ) -> 2
( 0 ) -> 2
( 0 3 ) -> 1
( 2 ) -> 0
( 2 3 ) -> 1
  1. Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default AFD discovery algorithm (Pyro) is used.
python3 cli.py --task=afd --table=../examples/datasets/inventory_afd.csv , True --error=0.1
( 0 ) -> 1
( 0 ) -> 2
( 1 ) -> 2
  1. Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.
python3 cli.py --task=mfd_verification --table=../examples/datasets/theatres_mfd.csv , True --lhs_indices=0 --rhs_indices=2 --metric=euclidean --parameter=5
True

For more information consult documentation and help files.

Python bindings

Desbordante features can be accessed from within Python programs by employing the Desbordante Python library. The library is implemented in the form of Python bindings to the interface of the Desbordante C++ core library, using pybind11. Apart from discovery and validation of patterns, this interface is capable of providing valuable additional information which can, for example, describe why a given pattern does not hold. All this allows end users to solve various data quality problems by constructing ad-hoc Python programs. To show the power of this interface, we have implemented several demo scenarios:

  1. Typo detection
  2. Data deduplication
  3. Anomaly detection

There is also an interactive demo for all of them, and all of these python scripts are here. The ideas behind them are briefly discussed in this preprint (Section 3).

Simple usage examples:

  1. Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the FD discovery algorithm HyFD is used.
import desbordante

TABLE = '../examples/datasets/university_fd.csv'

algo = desbordante.HyFD()
algo.set_option('table', (TABLE, ',', True))
algo.set_option('is_null_equal_null')
algo.load_data()
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
    print(fd)
FDs:
( 1 3 ) -> 0
( 1 3 ) -> 2
( 0 ) -> 2
( 0 3 ) -> 1
( 2 ) -> 0
( 2 3 ) -> 1
  1. Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used.
import desbordante

TABLE = '../examples/datasets/inventory_afd.csv'
ERROR = 0.1

algo = desbordante.Pyro()
algo.set_option('table', (TABLE, ',', True))
algo.set_option('is_null_equal_null')
algo.load_data()
algo.set_option('error', ERROR)
algo.set_option('threads')
algo.set_option('max_lhs')
algo.set_option('seed')
algo.execute()
result = algo.get_fds()
print('AFDs:')
for fd in result:
	print(fd)
AFDs:
( 0 ) -> 1
( 0 ) -> 2
( 1 ) -> 2
  1. Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.
import desbordante

TABLE = '../examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5

algo = desbordante.MetricVerifier()
algo.set_option('table', (TABLE, ',', True))
algo.set_option('is_null_equal_null')
algo.load_data()
algo.set_option('lhs_indices', LHS_INDICES)
algo.set_option('metric', METRIC)
algo.set_option('parameter', PARAMETER)
algo.set_option('dist_from_null_is_infinity')
algo.set_option('rhs_indices', RHS_INDICES)
algo.execute()
if algo.mfd_holds():
    print('MFD holds')
else:
    print('MFD does not hold')
MFD holds
  1. Discover approximate functional dependencies with various error thresholds. Here, we showcase the preferred approach to configuring algorithm options. Furthermore, we are using a pandas DataFrame to load data from a CSV file.
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.Pyro()
>>> df = pd.read_csv('iris.csv', sep=',', header=0)
>>> pyro.load_data(df)
>>> pyro.execute(error=0.0)
>>> pyro.get_fds()
[( 0 1 2 ) -> 4, ( 0 2 3 ) -> 4, ( 0 1 3 ) -> 4, ( 1 2 3 ) -> 4]
>>> pyro.execute(error=0.1)
>>> pyro.get_fds()
[( 2 ) -> 0, ( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 4, ( 2 ) -> 3, ( 3 ) -> 2, ( 3 ) -> 0, ( 0 ) -> 1, ( 0 ) -> 3, ( 1 ) -> 0, ( 1 ) -> 2, ( 3 ) -> 4, ( 3 ) -> 1, ( 1 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 4]
>>> pyro.execute(error=0.2)
>>> pyro.get_fds()
[( 2 ) -> 1, ( 2 ) -> 0, ( 2 ) -> 4, ( 0 ) -> 2, ( 2 ) -> 3, ( 0 ) -> 1, ( 3 ) -> 4, ( 3 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 0, ( 1 ) -> 2, ( 0 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 0, ( 1 ) -> 4, ( 1 ) -> 3]
>>> pyro.execute(error=0.3)
>>> pyro.get_fds()
[( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 0, ( 3 ) -> 0, ( 2 ) -> 3, ( 1 ) -> 0, ( 2 ) -> 4, ( 3 ) -> 2, ( 0 ) -> 1, ( 1 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 4, ( 0 ) -> 3, ( 4 ) -> 2, ( 4 ) -> 1, ( 0 ) -> 4, ( 1 ) -> 3, ( 1 ) -> 4, ( 4 ) -> 3]

Web interface

While the Python interface makes building interactive applications possible, Desbordante also offers a web interface which is aimed specifically for interactive tasks. Such tasks typically involve multiple steps and require substantial user input on each of them. Interactive tasks usually originate from Python scenarios, i.e. we select the most interesting ones and implement them in the web version. Currently, only the typo detection scenario is implemented. The web interface is also useful for pattern discovery and validation tasks: a user may specify parameters, browse results, employ advanced visualizations and filters, all in a convenient way.

You can try the deployed web version here. You have to register in order to process your own datasets. Keep in mind that due to a large demand various time and memory limits are enforced: processing is aborted if they are exceeded. The source code of the web interface is kept in a separate repo.

Build instructions

Ubuntu

The following instructions were tested on Ubuntu 20.04+ LTS.

Dependencies

Prior to cloning the repository and attempting to build the project, ensure that you have the following software:

  • GNU g++ compiler, version 10+
  • CMake, version 3.13+
  • Boost library, version 1.74.0+

To use test datasets you will need:

  • Git Large File Storage, version 3.0.2+

Building the project (first option: with tests)

Firstly, navigate to a desired directory. Then, clone the repository, cd into the project directory and launch the build script:

git clone https://github.com/Mstrutov/Desbordante/
cd Desbordante
./pull_datasets.sh
./build.sh

Building the project (second option: without tests)

Firstly, navigate to a desired directory. Then, clone the repository, cd into the project directory and launch the build script:

git clone https://github.com/Mstrutov/Desbordante/
cd Desbordante
./build.sh --no-tests --no-unpack

Launching the binaries

The script generates the following file structure in /path/to/Desbordante/build/target:

├───input_data
│   └───some-sample-csv\'s.csv
├───Desbordante_test
├───Desbordante_run

The input_data directory contains several .csv files that may be used by Desbordante_test. Run Desbordante_test to perform unit testing:

cd build/target
./Desbordante_test

The tool itself may be run like the following:

./Desbordante_run --algo=tane --data=<path_to_dataset>

Cite

If you use this software for research, please cite one of our papers:

  1. George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023).
  2. George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)". CoRR abs/2301.05965. (2023).
  3. M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.
  4. A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047.

Contacts and Q&A

If you have any questions regarding the tool usage you can ask it in our google group. To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.

desbordante's People

Contributors

buyt-1 avatar polyntsov avatar mstrutov avatar alexandrsmirn avatar elluran avatar aviu00 avatar eduardgaisin avatar vs9h avatar xjoskiy avatar sched71 avatar cupertank avatar aartdem avatar firsov62121 avatar toadharvard avatar rakhmukova avatar chernishev avatar daniilgoncharov avatar vyrodovmikhail avatar popov-dmitriy-ivanovich avatar egshnov avatar achains avatar michaels239 avatar studokim avatar pechenux avatar kirillsmirnov avatar iliya-b avatar antonchern avatar mstrutov2 avatar nvbobrov avatar nikita-talalai avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.