Giter VIP home page Giter VIP logo

datapatterns's People

Contributors

dcamper avatar dependabot[bot] avatar ghalliday avatar gordonsmith avatar jbrundage avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datapatterns's Issues

Self test is failing

Source Severity Code Message FileName LineNo Column id
user Error 100000 Assert (integer8 = integer3) failed [best_attribute_type = 'integer3'] /home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl 88 88 0
user Error 100000 Assert (unsigned4 = unsigned2) failed [best_attribute_type = 'unsigned2'] /home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl 417 417 1
user Error 100000 Assert (unsigned4 = unsigned2) failed [best_attribute_type = 'unsigned2'] /home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl 438 438 2
user Error 100000 Assert (unsigned4 = unsigned2) failed [best_attribute_type = 'unsigned2'] /home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl 459 459 3

Add skew analysis to output

Potential output could be same as what is shown in activity graphs, or could be more detailed (node-by-node).

Break out detailed cardinality in a separate output/dataset

Users would like to see the detailed value counts (cardinality) broken out all the way, which understandably would blow up the primary profiling dataset.

To speed up performance can we have the ability to turn off cardinality in the Profile function and then have an additional function that allows us to choose which columns we'd like a separate output of all the value counts (or choose all columns by default). If possible having all the columns broken out in this new output would allow users to join the field name back to the profile dataset for additional analysis. Leave that to you to determine to the performance limitations of such an approach.

Avoid symbol name collision with calling code

If the caller defines a symbol in their code after calling Profile() that is also defined as LOCAL within Profile(), and obtuse error will be presented. This is due to name collision, but it is not immediately obvious.

This problem is not specific to Profile(). All other function macros may need adjustment as well.

best_attribute_type leading zeros for float / exponential notation

The solution for #42 might have caused an issue suggesting the best type for fields containing only values that are numbers in decimal or exponential notations & begin in zero. For example - https://play.hpccsystems.com:18010/?Wuid=W20190722-192640&Widget=WUDetailsWidget#/stub/Resources-DL/Grid - specifically the "precipintensity" attribute.

In lieu of line 511 on db2a279#diff-a648eefa0718d8c9b3b40721e26e0563R511, would something like this regex work, if the implementation of the REGEXFIND builtin is capable of negative lookahead?

https://regex101.com/r/h7uzgD/3

Best Record Structure + var length string

There are situations where the optimal string format will be a variable length string. I recently had a field whose profile was:

  • min string len: 2
  • mean string len: 112
  • max string len: 37262
    (aprox numbers)

In the above scenario I would expect the "optimized" string to be a var length string?

Profile: Popular and rare patterns have problems with non-ASCII input

The Profile code uses regex's [[:upper:]] and [[:lower:]] to test characters, but not all Unicode alphabetic characters map to either. Those characters are therefore passed along as-is. The problem is, the pattern field itself is a STRING, so HPCC performs a coercion of those as-is characters and you wind up with something different (valid coercion, I-Don't-Know characters, etc).

One solution is to convert the pattern field to UTF8. Another is to find a way to map those types of Unicode characters better.

Add function for Benford's Law test on a column

Benford’s law can be useful to detect fraud and data errors because it’s expected that certain large sets of numbers will follow the law. I think it would be a great new feature for the Data Patterns library. It would be nice to have a function that could take a column of a data set and detect if it passes the Benford Test.

Optionally, It would also be nice to make it possible to test specified columns for Benford distribution in ECL Watch when analyzing data files. This should be optional because it is not applicable to all data columns (or even all numeric data).

Unable to process certain child datasets

If the input file contains an embedded child record that in turn contains a child dataset, the ECL compiler complains with a "no specified row for table" error.

Add support for embedded child records

It should be possible to treat top-level fields within the embedded record as "regular" top-level fields, just reported with a slightly different name.

Add multi-column cardinality

Multi column cardinality can be added as feature if possible.

Some use cases:

  • To identify the combination of columns that could be potentially used to identify the record uniquely
  • To know the cadinality of all keyed columns in an Index

Profile: Perform numeric analysis on fields with a numeric best_ecl_type

Currently, only fields with a numeric datatype on the original dataset acquire numeric analytics (min, ave, max, std dev, quartiles, etc). A field could be marked as a STRING and contain only numerics, but it will not have the numeric analytics in profile.

This request to provide numeric analytics on fields where a numeric datatype is determined to be the best type. That means that number-filled STRING fields -- which means all fields from a just-sprayed CSV file -- will be processed as numeric.

Store a table of common pattern value resolutions with Similarity Percentage (confidence)

Users love the pattern detection but would like to leverage those patterns against a dataset that keeps the most common resolution to those patterns as a potential 1 to many name value lookup. For instance patterns of 9999999999, 999-999-9999, +9 9999999999 would have values in this new dataset that flag it as a potential phone number. A sample of the output could look like the attached image.

image

Request: Easy method for analyzing different profiling results

Satisfy this scenario: Profiling is used to analyze new data that will be ingested. Profiling results are saved as a logical file. Then, a new batch of data arrives and is profiled. The new method should compare the new profiling results with the old and output a summary of any differences.

The end goal is to highlight significant differences between the two profiles, which could indicate a significant or unexpected change in the incoming data stream.

Tag repository with version on each release

Can you add an "annotated tag" to the repository @ every version?

git tag -a vX.Y.Z type of thang.

Note: The -a is important if a third party wants to include this repository as a submodule (and the versioning gives convenient commit hashes to attach to).

BestRecordStructure Suggestion

As part of the BestRecordStructure output, it would be nice if it included a second line with a a suitable transform function (especially for all the STRING -> REAL8 type conversions).

BestRecordStructure output order is incorrect

The RECORD and END are not in the correct order (END comes first):

NewLayout := RECORD
END;
//----------
NewLayout MakeNewLayout(OldLayout r) := TRANSFORM
    SELF.lon := (REAL8)r.lon;
    SELF.lat := (REAL8)r.lat;
    SELF := r;
END;
    REAL8 lon;
    REAL8 lat;
    STRING18 number;
    STRING300 street;
    STRING24 unit;
    STRING30 city;
    STRING17 district;
    STRING7 region;
    STRING10 postcode;
    STRING24 id;
    STRING16 hash;

Add UTF-8 detection

For UTF8 ECL datatypes, examine the contents and determine if the string really needs to be UTF-8 or if a simple ASCII STRING will do.

Super File (Grouped) Profile

  1. Create a consolidated super file report. Stats considering the entire data in the super file
  2. Breakdown of Stats at the Subfile level. So we can just drill down to a subfile stats view.

Add support for child datasets

At least report the data type and the min/max/ave number of records.

Support for second-level child datasets is somewhat more problematical (understatement).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.