mitdbg / aurum-datadiscovery Goto Github PK

View Code? Open in Web Editor NEW

75.0 75.0 49.0 44.56 MB

License: MIT License

Python 68.87% Java 15.05% HTML 0.19% JavaScript 14.53% C++ 1.07% Shell 0.01% CSS 0.28%

aurum-datadiscovery's People

Contributors

Stargazers

Watchers

aurum-datadiscovery's Issues

Bad data slows down profiler

De-noising data in general will help on overall performance by:

making the profiler work more efficient
improving accuracy

This requires that errors occurred for a given column are counted and the processing of that column abandoned when these trespass a given threshold:

For example, in data.gov. I observe multiple messages like:
WARN preanalysis.PreAnalyzer - Error while parsing: For input string: "523986004252398600465239860072"

In any case, this requires a more in depth study of what other errors are causing trouble.

Unify data source types

At the moment there is an enum for all types of data sources and another one for db types. Make this consistent and simplify it.

Decouple indexing from profiling

EDIT: [Profiling is about 1 order of magnitude faster than indexing.
Decouple both processes]

Decouple profiling process into the smallest pieces possible. For example, one should be able to index only schemas and no data, or data alone, etc. Then find a way of combining then back again into the original form. Challenge is to maintain all perf guarantees.

Appending cardinality as node attribute seems to break edges()

edges() on a networkx graph fails if the nodes contain a cardinality attribute (not confirmed this is the reason). If so, figure out how to append that attribute information properly so that edges() does not break.

Improve entity analysis and set automatic threshold for early stopping

These two features are commingled.

ddprofiler throws RemoteTransportException with elasticsearch v2.3.5

Noticed today while running ddprofiler against my local installation of elasticsearch 2.3.5. The ddprofiler server started throwing the following exception:

org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.liveness.LivenessResponse]]

This only happens after commit 0ee4a9a - Merge algebra into master. I'm guessing because of the transport changes on the store client?

Workaround: I downgraded to elasticsearch v2.3.0 as per requirements.txt, and then it works fine.

Review multithread operation in standalone mode

When reading from a DB, do we need to pool connections for better use resource? Are we limited by the throttling mechanism of db otherwise? Explore these questions.

Explain schema complement

Explain the answer to schema_complement queries

Better metrics for profiler

So that we understand what errors occur:

when processing a source
when processing a record

and we can report them when the job is done.

Early cut of entityAnalyzer

Our entity analyzer task consists of understanding the entities contained in a set of values. As these values are all supposed to "mean" the same, we should stop analyzing entities (which is costly) as soon as we are confident of having discovered the underlying entities.

Include value examples with answers

So that it is easier for users to understand the results of a query.

Inefficient data reads

Far too much object generation without a good reason when reading records. Streamline the implementation for CSV files. Write a perf test to track this in the future.

FileConnector.getAttributes wrong usage of getTableAttributes

There is a comment that says that the usage of getTableAttributes is wrong. I left a FIXME asking what is wrong. Please indicate so that we can see how to fix this.

Add standard deviation to RangeAnalyzer

It is straightforward to compute this in a streaming fashion. Add the attribute to the Analyzer class, compute it and return it in the object NumericalRange.

ProfilerConfig property to choose Store

There are 3 store implementations right now that the profiler can choose. Let's create a new property so that we can choose which one to use and give that property to StoreFactory to do the rest.

Keyword search returns duplicate resuls

This is due to the multiple input that indexing data produces (in the case of large input attribute values).

Enumerate strategies to compute overlap

Approximate strategies are ok

Prepare e2e deployment of profiler

Configure store, properties through command line (make sure they are compatible) and possibly package profiler into a fat jar for easy deployment.

Accuracy of approximate summaries

Quantiles and cardinalities depend on approximations at the moment. We need some infrastructure to measure the accuracy of these methods, compare them and decide how to adjust them for better accuracy/performance.

Fix naming issues with connectors

For example:

rec.getTuples().add(v1)

v1, however, is not a tuple, but one value of one column.
Identify the other cases and fix them too.

Preload all entity models from OpenNLP

The method EntityAnalyzer.loadModel() should return instead a list of TokenNameFinderModel, each of them corresponding to a different entity model. Then these models should be applied in the feed() method, so that all existing entities are detected.

Stemming lookup queries and their indices in the store

schema_name_search("molecule")

won't return a field named "molecular"

Enumerate noise-cleaning strategies

We cannot clean the data before profiling it due to cost-value arguments. However, we should perform a best-effort approach to remove noise (null values, "X" that mean NO, outliers that are evident, etc) to raise the quality of the profiled information.

Make profiler fault tolerant

When profiling a column fails, this should not break the profiler, but only log in the error and the cause to some file, skip and keep working.

Investigate other pre-trained NLP models online

For OpenNLP

Complete and test offline mode

The profiler can run in two modes. In online mode it starts a server (embedded jetty) that receives requests in the form of a path and the name of a source.

In the offline mode, the user must configure a path in the command line, then Main reads the files in the path and creates tasks that are submitted to Conductor.

This issue consists of:
1- Make sure a user can configure a path to (CSV) files through the command line, and that these are added to ProfilerConfig correctly.
2- When reading files, get their name and path so that the WorkerTasks can be created properly. There are TODOs indicating where is the missing info (lines 85,86,87 in Main.java)

Adjust analyzer from elasticsearch

The analyzer can be configured with a stemming and word removal. We must understand all the options and adjust it for the best accuracy possible.

Showing exact overlap helps

This is a desirable property, total values, unique values, matching values. Overlap among columns.

Implementing additional readRows in DBConnector

The Connector class implements an additional function than initially.

"public abstract Map<Attribute, List> readRows(int num) throws IOException, SQLException;"

This abstract function is implemented for FileConnector, but not for DBConnector (empty method in line 208 of DBConnector).

This issue consists of implementing such method and testing it agains a database.

Mismatch between java and python id generation

This leads to lots of problems when retrieving id from the store (java generated) and try to use it to lookup the graph (python generated). Unify them with crc32 or similar.

Include the five-number summary as part of range analyzer

Find a (possibly) approximate way of computing percentiles in a streaming fashion.

Improve efficiency and accuracy of data type detection

Replace the existing try/catch model by a regex-based one that would provide the data type with higher accuracy and in a more efficient way.

LSA fails when dimensions are smaller than target set of components

One cannot reduce dimensions to X when the current dimensions are < X. Just fix the parameter so that it can be set from a config file, and so that there is some check that fixes this dynamically if necessary.

Faster indexing

Check how to improve elasticsearch's performance
Build a pre-indexer that filters out data that has been indexed for a given column. Basically this requires a count-min sketch per column, so that we can decide not to send certain data to the store if it's been already indexed (note that even though sending the data won't change the index, it requires processing anyway).
throw more strategies here... (that don't involve building our own stuff, for now)

Unify keyword search

Search in content, schema name, table name, db name.
Then indicate the context of the results.

move scores from Hit (nodes) to edges

Exact keyword/schema search

When searching for schema names, sometimes it is useful to search for exact matches, rather than approximations. In certain cases, users know the exact name of the schema, and therefore it is more useful to do an exact search. Add some property to permit this.

Parse input options

Parse input options, based on the options defined in ProfilerConfig, so that users can start the profiler with a set of parameters. Unit test.

Error checking when input values do not exist

For similarity functions, mainly.

Unit test DBConnector with Oracle db

Make sure the DBConnector code is compatible with as many dbs as possible. In particular test that the method:

"public List getAttributes() throws SQLException {"

in DBConnector works well.

In particular, in the short-term, we need the code to work with an Oracle 10g database.

Find alternative for bin files

So that there's no need for the parent directory to contain the files

Specialized tokenizers for db data

Schemas have names like:

last_name
us_phone_number

find a better tokenizer to cover these cases. This will boost accuracy a lot.
These new tokenizers would be part of elastic search.

Wrong matching of floats with mixed commas and dots

I have created a "SanitizationTest" that contains a few class attributes with real data. Those values have in some case the form:

160,124.05 (note the first comma)

When this is part of a field in a CSV file, the protocol says that the value should be escaped, in quotes. For example, if we have field A,B,C, with B of the form shown previously, a file could be like:

"Boston", "160,124.05", 89

The current implementation does not parse the second field properly, which introduces errors that propagate through the entire prototype.

bugs in the CardinalityAnalyzer

The CardinalityAnalyer uses HyperLogLogPlus to estimate the carnality. However, the results are error prone. For example, I use the module to derive the carnality of 500 records.

The results shown that the unique elements are 531 while the total number of records are 500.

Probably we should use a better estimator.

mitdbg / aurum-datadiscovery Goto Github PK

aurum-datadiscovery's People

Contributors

Stargazers

Watchers

Forkers

aurum-datadiscovery's Issues

Recommend Projects

Recommend Topics

Recommend Org