hpcc-systems / datapatterns Goto Github PK

View Code? Open in Web Editor NEW

3.0 20.0 4.0 670 KB

HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer

ECL 88.62% CSS 0.44% HTML 0.73% JavaScript 0.23% TypeScript 9.98%

ecl-bundle data-profiling hpcc-systems hpcc-platform

datapatterns's People

Contributors

Stargazers

Watchers

Forkers

gordonsmith sunilbabu1981 dcamper ghalliday

datapatterns's Issues

Self test is failing

Source	Severity	Code	Message	FileName	LineNo	Column	id
user	Error	100000	Assert (integer8 = integer3) failed [best_attribute_type = 'integer3']	/home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl	88	88	0
user	Error	100000	Assert (unsigned4 = unsigned2) failed [best_attribute_type = 'unsigned2']	/home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl	417	417	1
user	Error	100000	Assert (unsigned4 = unsigned2) failed [best_attribute_type = 'unsigned2']	/home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl	438	438	2
user	Error	100000	Assert (unsigned4 = unsigned2) failed [best_attribute_type = 'unsigned2']	/home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl	459	459	3

Add skew analysis to output

Potential output could be same as what is shown in activity graphs, or could be more detailed (node-by-node).

best_attribute_type issue? "Numeric" strings with leading zeros...

You can probably argue this both ways - but if a string contains a numeric, but has leading zeros then it should probably be treated as a string?

Examples:

British secret intelligence service agent assignations
FIPs codes

Add support for top-level SET datatypes

At minimum, report given data type and the min/max/ave number of elements.

Break out detailed cardinality in a separate output/dataset

Users would like to see the detailed value counts (cardinality) broken out all the way, which understandably would blow up the primary profiling dataset.

To speed up performance can we have the ability to turn off cardinality in the Profile function and then have an additional function that allows us to choose which columns we'd like a separate output of all the value counts (or choose all columns by default). If possible having all the columns broken out in this new output would allow users to join the field name back to the profile dataset for additional analysis. Leave that to you to determine to the performance limitations of such an approach.

Avoid symbol name collision with calling code

If the caller defines a symbol in their code after calling Profile() that is also defined as LOCAL within Profile(), and obtuse error will be presented. This is due to name collision, but it is not immediately obvious.

This problem is not specific to Profile(). All other function macros may need adjustment as well.

Include version in DP Output (if it doesn't exist already)

Embed the version of DataPatterns that created the output as part of the output (will be useful for backward compatibility once integrated into ECL Watch).

best_attribute_type leading zeros for float / exponential notation

The solution for #42 might have caused an issue suggesting the best type for fields containing only values that are numbers in decimal or exponential notations & begin in zero. For example - https://play.hpccsystems.com:18010/?Wuid=W20190722-192640&Widget=WUDetailsWidget#/stub/Resources-DL/Grid - specifically the "precipintensity" attribute.

In lieu of line 511 on db2a279#diff-a648eefa0718d8c9b3b40721e26e0563R511, would something like this regex work, if the implementation of the REGEXFIND builtin is capable of negative lookahead?

https://regex101.com/r/h7uzgD/3

Erroneous HTML fragment in visualized reports

Please see enclosed screenshot.

set-value security fix

From the automated vulnerability finder:

@GordonSmith FYI
@jbrundage Can you update and test? Target would be candidate-1.5.6. Thanks!

Cannot cite a child dataset in the fieldListStr argument to Profile()

Best Record Structure + var length string

There are situations where the optimal string format will be a variable length string. I recently had a field whose profile was:

min string len: 2
mean string len: 112
max string len: 37262
(aprox numbers)

In the above scenario I would expect the "optimized" string to be a var length string?

Profile: Popular and rare patterns have problems with non-ASCII input

The Profile code uses regex's [[:upper:]] and [[:lower:]] to test characters, but not all Unicode alphabetic characters map to either. Those characters are therefore passed along as-is. The problem is, the pattern field itself is a STRING, so HPCC performs a coercion of those as-is characters and you wind up with something different (valid coercion, I-Don't-Know characters, etc).

One solution is to convert the pattern field to UTF8. Another is to find a way to map those types of Unicode characters better.

Request: Additional info for low-cardinality fields

For fields where cardinality <=64 values, show those values along with record counts.

Add function for Benford's Law test on a column

Benford’s law can be useful to detect fraud and data errors because it’s expected that certain large sets of numbers will follow the law. I think it would be a great new feature for the Data Patterns library. It would be nice to have a function that could take a column of a data set and detect if it passes the Benford Test.

Optionally, It would also be nice to make it possible to test specified columns for Benford distribution in ECL Watch when analyzing data files. This should be optional because it is not applicable to all data columns (or even all numeric data).

Unable to process certain child datasets

If the input file contains an embedded child record that in turn contains a child dataset, the ECL compiler complains with a "no specified row for table" error.

Bug: Cannot process record definitions containing reserved words as field names

Example:

Layout := {STRING loop};

Add support for embedded child records

It should be possible to treat top-level fields within the embedded record as "regular" top-level fields, just reported with a slightly different name.

Add multi-column cardinality

Multi column cardinality can be added as feature if possible.

Some use cases:

To identify the combination of columns that could be potentially used to identify the record uniquely
To know the cadinality of all keyed columns in an Index

Profile: Perform numeric analysis on fields with a numeric best_ecl_type

Currently, only fields with a numeric datatype on the original dataset acquire numeric analytics (min, ave, max, std dev, quartiles, etc). A field could be marked as a STRING and contain only numerics, but it will not have the numeric analytics in profile.

This request to provide numeric analytics on fields where a numeric datatype is determined to be the best type. That means that number-filled STRING fields -- which means all fields from a just-sprayed CSV file -- will be processed as numeric.

Request: Histogram distribution for numeric field values

Unable to profile dataset containing field named 'row'

Store a table of common pattern value resolutions with Similarity Percentage (confidence)

Users love the pattern detection but would like to leverage those patterns against a dataset that keeps the most common resolution to those patterns as a potential 1 to many name value lookup. For instance patterns of 9999999999, 999-999-9999, +9 9999999999 would have values in this new dataset that flag it as a potential phone number. A sample of the output could look like the attached image.

Request: Easy method for analyzing different profiling results

Satisfy this scenario: Profiling is used to analyze new data that will be ingested. Profiling results are saved as a logical file. Then, a new batch of data arrives and is profiled. The new method should compare the new profiling results with the old and output a summary of any differences.

The end goal is to highlight significant differences between the two profiles, which could indicate a significant or unexpected change in the incoming data stream.

Tag repository with version on each release

Can you add an "annotated tag" to the repository @ every version?

git tag -a vX.Y.Z type of thang.

Note: The -a is important if a third party wants to include this repository as a submodule (and the versioning gives convenient commit hashes to attach to).

NewLayout := RECORD
END;
//----------
NewLayout MakeNewLayout(OldLayout r) := TRANSFORM
    SELF.lon := (REAL8)r.lon;
    SELF.lat := (REAL8)r.lat;
    SELF := r;
END;
    REAL8 lon;
    REAL8 lat;
    STRING18 number;
    STRING300 street;
    STRING24 unit;
    STRING30 city;
    STRING17 district;
    STRING7 region;
    STRING10 postcode;
    STRING24 id;
    STRING16 hash;

Create a consolidated super file report. Stats considering the entire data in the super file
Breakdown of Stats at the Subfile level. So we can just drill down to a subfile stats view.

Add support for child datasets

At least report the data type and the min/max/ave number of records.

Support for second-level child datasets is somewhat more problematical (understatement).