hpcc-systems / datapatterns Goto Github PK
View Code? Open in Web Editor NEWHPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer
HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer
Source | Severity | Code | Message | FileName | LineNo | Column | id |
---|---|---|---|---|---|---|---|
user | Error | 100000 | Assert (integer8 = integer3) failed [best_attribute_type = 'integer3'] | /home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl | 88 | 88 | 0 |
user | Error | 100000 | Assert (unsigned4 = unsigned2) failed [best_attribute_type = 'unsigned2'] | /home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl | 417 | 417 | 1 |
user | Error | 100000 | Assert (unsigned4 = unsigned2) failed [best_attribute_type = 'unsigned2'] | /home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl | 438 | 438 | 2 |
user | Error | 100000 | Assert (unsigned4 = unsigned2) failed [best_attribute_type = 'unsigned2'] | /home/gordon/git/HPCC-Platform/ecllibrary/teststd/DataPatterns/TestDataPatterns.ecl | 459 | 459 | 3 |
Potential output could be same as what is shown in activity graphs, or could be more detailed (node-by-node).
You can probably argue this both ways - but if a string contains a numeric, but has leading zeros then it should probably be treated as a string?
Examples:
At minimum, report given data type and the min/max/ave number of elements.
Users would like to see the detailed value counts (cardinality) broken out all the way, which understandably would blow up the primary profiling dataset.
To speed up performance can we have the ability to turn off cardinality in the Profile function and then have an additional function that allows us to choose which columns we'd like a separate output of all the value counts (or choose all columns by default). If possible having all the columns broken out in this new output would allow users to join the field name back to the profile dataset for additional analysis. Leave that to you to determine to the performance limitations of such an approach.
If the caller defines a symbol in their code after calling Profile() that is also defined as LOCAL within Profile(), and obtuse error will be presented. This is due to name collision, but it is not immediately obvious.
This problem is not specific to Profile(). All other function macros may need adjustment as well.
Embed the version of DataPatterns that created the output as part of the output (will be useful for backward compatibility once integrated into ECL Watch).
The solution for #42 might have caused an issue suggesting the best type for fields containing only values that are numbers in decimal or exponential notations & begin in zero. For example - https://play.hpccsystems.com:18010/?Wuid=W20190722-192640&Widget=WUDetailsWidget#/stub/Resources-DL/Grid - specifically the "precipintensity" attribute.
In lieu of line 511 on db2a279#diff-a648eefa0718d8c9b3b40721e26e0563R511, would something like this regex work, if the implementation of the REGEXFIND builtin is capable of negative lookahead?
From the automated vulnerability finder:
@GordonSmith FYI
@jbrundage Can you update and test? Target would be candidate-1.5.6. Thanks!
There are situations where the optimal string format will be a variable length string. I recently had a field whose profile was:
In the above scenario I would expect the "optimized" string to be a var length string?
The Profile code uses regex's [[:upper:]] and [[:lower:]] to test characters, but not all Unicode alphabetic characters map to either. Those characters are therefore passed along as-is. The problem is, the pattern field itself is a STRING, so HPCC performs a coercion of those as-is characters and you wind up with something different (valid coercion, I-Don't-Know characters, etc).
One solution is to convert the pattern field to UTF8. Another is to find a way to map those types of Unicode characters better.
For fields where cardinality <=64 values, show those values along with record counts.
Benford’s law can be useful to detect fraud and data errors because it’s expected that certain large sets of numbers will follow the law. I think it would be a great new feature for the Data Patterns library. It would be nice to have a function that could take a column of a data set and detect if it passes the Benford Test.
Optionally, It would also be nice to make it possible to test specified columns for Benford distribution in ECL Watch when analyzing data files. This should be optional because it is not applicable to all data columns (or even all numeric data).
If the input file contains an embedded child record that in turn contains a child dataset, the ECL compiler complains with a "no specified row for table" error.
Example:
Layout := {STRING loop};
It should be possible to treat top-level fields within the embedded record as "regular" top-level fields, just reported with a slightly different name.
Multi column cardinality can be added as feature if possible.
Some use cases:
Currently, only fields with a numeric datatype on the original dataset acquire numeric analytics (min, ave, max, std dev, quartiles, etc). A field could be marked as a STRING and contain only numerics, but it will not have the numeric analytics in profile.
This request to provide numeric analytics on fields where a numeric datatype is determined to be the best type. That means that number-filled STRING fields -- which means all fields from a just-sprayed CSV file -- will be processed as numeric.
Users love the pattern detection but would like to leverage those patterns against a dataset that keeps the most common resolution to those patterns as a potential 1 to many name value lookup. For instance patterns of 9999999999, 999-999-9999, +9 9999999999 would have values in this new dataset that flag it as a potential phone number. A sample of the output could look like the attached image.
Satisfy this scenario: Profiling is used to analyze new data that will be ingested. Profiling results are saved as a logical file. Then, a new batch of data arrives and is profiled. The new method should compare the new profiling results with the old and output a summary of any differences.
The end goal is to highlight significant differences between the two profiles, which could indicate a significant or unexpected change in the incoming data stream.
Can you add an "annotated tag" to the repository @ every version?
git tag -a vX.Y.Z
type of thang.
Note: The -a is important if a third party wants to include this repository as a submodule (and the versioning gives convenient commit hashes to attach to).
As part of the BestRecordStructure output, it would be nice if it included a second line with a a suitable transform function (especially for all the STRING -> REAL8 type conversions).
If you run an ECL job where there are multiple data profiling results, no result graphs show up in the Resources tab (though there may be a lot of hidden content, based on how the scroll bar looks).
The RECORD and END are not in the correct order (END comes first):
NewLayout := RECORD
END;
//----------
NewLayout MakeNewLayout(OldLayout r) := TRANSFORM
SELF.lon := (REAL8)r.lon;
SELF.lat := (REAL8)r.lat;
SELF := r;
END;
REAL8 lon;
REAL8 lat;
STRING18 number;
STRING300 street;
STRING24 unit;
STRING30 city;
STRING17 district;
STRING7 region;
STRING10 postcode;
STRING24 id;
STRING16 hash;
For UTF8 ECL datatypes, examine the contents and determine if the string really needs to be UTF-8 or if a simple ASCII STRING will do.
The order in which the fields appear in the rebuilt RECORD structure does not match the input.
At least report the data type and the min/max/ave number of records.
Support for second-level child datasets is somewhat more problematical (understatement).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.