Giter VIP home page Giter VIP logo

aurum-datadiscovery's People

Contributors

damienrrb avatar florents-tselai avatar jmftrindade avatar justinanderson avatar mansoure avatar michaeldh42 avatar nato16 avatar raulcf avatar rawatvimal avatar rogertangos avatar snowgy avatar suhailshergill avatar svdwoude avatar wangsibovictor avatar ygina avatar yinyanghu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aurum-datadiscovery's Issues

Bad data slows down profiler

De-noising data in general will help on overall performance by:

  • making the profiler work more efficient
  • improving accuracy

This requires that errors occurred for a given column are counted and the processing of that column abandoned when these trespass a given threshold:

For example, in data.gov. I observe multiple messages like:
WARN preanalysis.PreAnalyzer - Error while parsing: For input string: "523986004252398600465239860072"

In any case, this requires a more in depth study of what other errors are causing trouble.

Unify data source types

At the moment there is an enum for all types of data sources and another one for db types. Make this consistent and simplify it.

Decouple indexing from profiling

EDIT: [Profiling is about 1 order of magnitude faster than indexing.
Decouple both processes]

Decouple profiling process into the smallest pieces possible. For example, one should be able to index only schemas and no data, or data alone, etc. Then find a way of combining then back again into the original form. Challenge is to maintain all perf guarantees.

ddprofiler throws RemoteTransportException with elasticsearch v2.3.5

Noticed today while running ddprofiler against my local installation of elasticsearch 2.3.5. The ddprofiler server started throwing the following exception:

org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.liveness.LivenessResponse]]

This only happens after commit 0ee4a9a - Merge algebra into master. I'm guessing because of the transport changes on the store client?

Workaround: I downgraded to elasticsearch v2.3.0 as per requirements.txt, and then it works fine.

Better metrics for profiler

So that we understand what errors occur:

  • when processing a source
  • when processing a record

and we can report them when the job is done.

Early cut of entityAnalyzer

Our entity analyzer task consists of understanding the entities contained in a set of values. As these values are all supposed to "mean" the same, we should stop analyzing entities (which is costly) as soon as we are confident of having discovered the underlying entities.

Inefficient data reads

Far too much object generation without a good reason when reading records. Streamline the implementation for CSV files. Write a perf test to track this in the future.

Add standard deviation to RangeAnalyzer

It is straightforward to compute this in a streaming fashion. Add the attribute to the Analyzer class, compute it and return it in the object NumericalRange.

ProfilerConfig property to choose Store

There are 3 store implementations right now that the profiler can choose. Let's create a new property so that we can choose which one to use and give that property to StoreFactory to do the rest.

Prepare e2e deployment of profiler

Configure store, properties through command line (make sure they are compatible) and possibly package profiler into a fat jar for easy deployment.

Accuracy of approximate summaries

Quantiles and cardinalities depend on approximations at the moment. We need some infrastructure to measure the accuracy of these methods, compare them and decide how to adjust them for better accuracy/performance.

Fix naming issues with connectors

For example:

rec.getTuples().add(v1)

v1, however, is not a tuple, but one value of one column.
Identify the other cases and fix them too.

Preload all entity models from OpenNLP

The method EntityAnalyzer.loadModel() should return instead a list of TokenNameFinderModel, each of them corresponding to a different entity model. Then these models should be applied in the feed() method, so that all existing entities are detected.

Enumerate noise-cleaning strategies

We cannot clean the data before profiling it due to cost-value arguments. However, we should perform a best-effort approach to remove noise (null values, "X" that mean NO, outliers that are evident, etc) to raise the quality of the profiled information.

Make profiler fault tolerant

When profiling a column fails, this should not break the profiler, but only log in the error and the cause to some file, skip and keep working.

Complete and test offline mode

The profiler can run in two modes. In online mode it starts a server (embedded jetty) that receives requests in the form of a path and the name of a source.

In the offline mode, the user must configure a path in the command line, then Main reads the files in the path and creates tasks that are submitted to Conductor.

This issue consists of:
1- Make sure a user can configure a path to (CSV) files through the command line, and that these are added to ProfilerConfig correctly.
2- When reading files, get their name and path so that the WorkerTasks can be created properly. There are TODOs indicating where is the missing info (lines 85,86,87 in Main.java)

Adjust analyzer from elasticsearch

The analyzer can be configured with a stemming and word removal. We must understand all the options and adjust it for the best accuracy possible.

Implementing additional readRows in DBConnector

The Connector class implements an additional function than initially.

"public abstract Map<Attribute, List> readRows(int num) throws IOException, SQLException;"

This abstract function is implemented for FileConnector, but not for DBConnector (empty method in line 208 of DBConnector).

This issue consists of implementing such method and testing it agains a database.

Faster indexing

  • Check how to improve elasticsearch's performance
  • Build a pre-indexer that filters out data that has been indexed for a given column. Basically this requires a count-min sketch per column, so that we can decide not to send certain data to the store if it's been already indexed (note that even though sending the data won't change the index, it requires processing anyway).
  • throw more strategies here... (that don't involve building our own stuff, for now)

Unify keyword search

Search in content, schema name, table name, db name.
Then indicate the context of the results.

Exact keyword/schema search

When searching for schema names, sometimes it is useful to search for exact matches, rather than approximations. In certain cases, users know the exact name of the schema, and therefore it is more useful to do an exact search. Add some property to permit this.

Parse input options

Parse input options, based on the options defined in ProfilerConfig, so that users can start the profiler with a set of parameters. Unit test.

Unit test DBConnector with Oracle db

Make sure the DBConnector code is compatible with as many dbs as possible. In particular test that the method:

"public List getAttributes() throws SQLException {"

in DBConnector works well.

In particular, in the short-term, we need the code to work with an Oracle 10g database.

Specialized tokenizers for db data

Schemas have names like:

last_name
us_phone_number

find a better tokenizer to cover these cases. This will boost accuracy a lot.
These new tokenizers would be part of elastic search.

Wrong matching of floats with mixed commas and dots

I have created a "SanitizationTest" that contains a few class attributes with real data. Those values have in some case the form:

160,124.05 (note the first comma)

When this is part of a field in a CSV file, the protocol says that the value should be escaped, in quotes. For example, if we have field A,B,C, with B of the form shown previously, a file could be like:

"Boston", "160,124.05", 89

The current implementation does not parse the second field properly, which introduces errors that propagate through the entire prototype.

bugs in the CardinalityAnalyzer

The CardinalityAnalyer uses HyperLogLogPlus to estimate the carnality. However, the results are error prone. For example, I use the module to derive the carnality of 500 records.

The results shown that the unique elements are 531 while the total number of records are 500.

Probably we should use a better estimator.

Parallelism granularity

Right now is table to avoid redundant reads.
By splitting data on memory we could provide more fine-granular parallelism, i.e. per column, while still avoiding redundant reads.

Improve source identification

Right now we have (source, field).
As we merge sources from different databases and repositories, it becomes useful to identify these too. So for example:
((database, source) , field)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.