A universal data processor
The idea with this (.net standard 2.0) library component is to be able to provide it any piece of data and be able to explore that piece of data. I find that very often when dealing with a new piece of packaged information, be it a file format, a database, a zipfile with stuff in it, the first step is unpacking it into some kind of data structure. This unpacking work is unique per piece of information but the process that is followed is very similar for every case.
It usually begins with manually exploring the data, followed by writing some scripts or code to either ingest the data directly or to transform it into something that can be injested. Sometimes it needs to be injested and processed in an optimal way, othertimes it is not important.
So the aim with this library is to convert data (whatever its source) into a self-describing format that can be explored by software. This opens up a number of analytics and downstream possibilities.
There are two fundamental assumptions in making this library.
Firstly, the self-describing data is not about data types but real world "human" concepts. If you want data types only, we already have an excellent reflection system in .net that covers everything you need. Obviously there is some overlap, but in general the self-describing approach primarily tries to look at the data the way a human might look at the data.
Secondly, this library is not about performance, its about understanding. Once data is understood, then I can well imagine a downstream component will be able to use that data to generate optimised code for bulk importing data from one format into another - but that is not this library. That isn't to say that the code will never be optimised, just that its not a big focus and in the case of conflict the library will always favor clarity over performance.
Currently outlined are 2 aspects to the universal processor
- self-describing objects : These classes describe real-world concepts
- Describers : These classes take incoming data and convert them into self-describing objects
Still to be deisgned are
- ReferenceHandlers : These classes handle expanding references like filenames or urls into actual content.
- Analysers : An analyser will provide additional insight into a given piece of data
Obviously this is subject to change but this is the broad goals.
Get the basic structure in place for self-describing objects Be able to describe - a filename, a url, a word (token), a string of text, an integer, a .net object
structure for reference handlers Reference handlers for: binary files, text files, web content be able to describe - a floating point number, a fraction, a percentage, a scientific number
structure for nested content be able to describe - a list of things, a line of text, a money amount, time, dates testing goal: describe a basic text file.
be able to describe - a table of data, a name-value pair, categorised data, positioned text (indented) testing goal: describe an ini file.
able to describe - time spans, data ranges, number spans, well known mathematical constants structure for analytics testing goal: describe a csv file, describe a tsv file
analytics: support for data mapping with nominial, ordered nominal and ordinal references analytics: support for defined interval scales
be able to describe - hierachical data, testing goal: describe an xml file without a schema analytics: count, mean, std, min, quartiles, max for numberic data types
support for tar, zip, gz, slob content extraction testing goal: able to describe a zip file with data in it
analytics: references between datasets (relationships) support for sqllite db
testing goal: describe the top 10 datsets available on kaggle.com