Giter VIP home page Giter VIP logo

Comments (10)

stites avatar stites commented on May 24, 2024

Leveraging the parsec format should make this pretty simple. @ocramz you also may want to monospace those @ signs : )

from dh-core.

arvindd avatar arvindd commented on May 24, 2024

Hello @ocramz, I have almost done this so far:

https://github.com/arvindd/dh-core/blob/arvindd-arff-38/datasets/src/Numeric/Datasets/ArffParser.hs

This file is under the datasets folder:

https://github.com/arvindd/dh-core/tree/arvindd-arff-38

Basically, I added a module Numeric.Datasets.ArffParser in the datasets, and also added some ARFF files in the datafiles. The module exposes one function: parseArff, which returns a tuple (Relation name, Attributes, ARFF Records). The ARFF records are returned as [Maybe Dynamic] - with Nothing for missing values (i.e., the original ARFF contains a '?'), and Just <dynamic value> for values. I hope you had the same idea :-)

As a proof of usage of this parser, I plan to add more datasets utilising the ARFF files that I have added also in the datafiles folder. I will need to expose more types from the ArffParser for this - for example, I will have to expose the Datatype, Attribute, etc.

While I am working on this, can you (or anybody else) do a cursory review of the code above and provide me any feedback that I can consider in the next commits?

PS: Please assign this feature / issue to me, and I shall send a pull-request after I finish.

from dh-core.

ocramz avatar ocramz commented on May 24, 2024

Thank you @arvindd ! First thing that stuck out for me is your use of Data.Dynamic, what's the motivation there?

from dh-core.

arvindd avatar arvindd commented on May 24, 2024

The reason I chose to use Dynamic is because our parser will not know the type of fields in the record before-hand. It can only know the type after it parses the @attribute section of the ARFF file.

Having known the type in the @attribute section, the function fieldval can parse the right type of field, and then put that into a Dynamic element, so that we could have a list of Dynamics (in our case, Maybe Dynamic to take care of missing values).

An alternative would have been to just parse the data-fields in the record as text, and then provide two "conversion" classes (such as FromArffRecord and ToArffRecord) - so that the users of these record can directly create their own data-types as instances of this class. This would have possibly been easier than the way I chose.

I chose the Dynamic way so that readDataset / safeReadDataset could directly extract the list into a vector, later. Also, using Dynamic builds in type-checking into the parser as we already know the "expected" types of fields in the record (as given in the @attribute section). Using the FromArffRecord / ToArffRecord method will not use the available information in the @attribute section.

Do you have other ideas here?

from dh-core.

ocramz avatar ocramz commented on May 24, 2024

@arvindd thank you for your explanation :) indeed, as you say Dynamic or similar is unescapable. An alternative would be a sum type of types (e.g. data Ty = TyInt Int | TyDouble Double | TyText Text etc.) but this approach still would need to supply parsers (as I'm doing here basically https://github.com/ocramz/heidi/blob/master/src/Data/Generics/Encode/Internal.hs#L122 ).

You are essentially building yet another take on Haskell dataframes :)

The thing is that datasets assumes instead fixed record types, so I guess I should just rename this ticket and your work could go in dh-core proper.

from dh-core.

arvindd avatar arvindd commented on May 24, 2024

@ocramz your assumption of datasets having a fixed-record type is still valid. The users of the ArffParser will anyway model a known ARFF file using this parser and extracting the fields into their specific record-types - so, essentially, this falls in as a fixed-type dataset.

Your idea of using sum-types for this problem is very nice. The immediate benefit that I see from this is the fact that we could parse even non-primitive types directly from the records - since you supply the parsers yourself. By using Dynamic, we now need another layer of "parsers" that parse the Dynamics into record types that we need, anyway - as I mention above. That's essentially what is currently done with the CSV parser too: the current embedded iris (and similar) datasets are CSV files, which provide fields in the ByteString form, and the iris has a fromField instance of the record to get the right field type.

Using Dynamic will not solve this problem. We still need that extra-layer of parsers to extract values from Dynamic into the right record fields (as is currently done in Iris.hs). What using Dynamic solves is simply that we not only type-check within the parser (because we know the types of the fields from the @attribute section), but also simply tuck the value into the Dynamic element (instead of a ByteString) so that it can just be extracted using fromDyn a :: <type> in the dataset model file such as Iris.hs.

The current implementation of ArffParser also does not encode sample-classes into any form - it simply passes that as a ByteString, but checks if the class is indeed valid using the information from @attribute. So, for example, an Iris-Sentosa will just be parsed as a string-type - only that it is checked that such a class exist (using @attribute information). The best way to encode this information too us by using dependent types :-) Or, Singleton as is currently the only way in Haskell for such problems. I just did not want to make this more complicated than it is already :-)

from dh-core.

arvindd avatar arvindd commented on May 24, 2024

Hello @ocramz, I have now committed the final code for this, including an example dataset (Diabetes.hs) that uses the new ARFF parser. Kindly review the same and suggest improvements / feedback.

Also, suggest anything I have to do before I could raise a pull-request for the same.

Here are the main files that were added:

There are modifications in these files:

from dh-core.

ocramz avatar ocramz commented on May 24, 2024

Hi @arvindd , thank you for this! I also appreciate a lot that you provided one usage example, which is clean and easy to follow (and extend too!).

It'd be most convenient for me (and/or any other reviewers) if we could discuss the changes on the core modules over a pull request.

from dh-core.

arvindd avatar arvindd commented on May 24, 2024

Thanks @ocramz, I have raised a PR #65 for this issue. Requested both yourself and @stites for a review as suggested based on the git-blame on the files changed.

All checks have passed after the build, and I get a message that I could merge even without a review being done. I also get a message that the merge can be automatically done as there are no merge conflicts.

Since I am not sure what is the process at DataHaskell regarding PRs - I have not yet merged (this is my first PR here :-)).

Kindly suggest if I could merge the PR now, or wait for your reviews to happen.

from dh-core.

ocramz avatar ocramz commented on May 24, 2024

@arvindd Thank you for your work, this can be closed now!

from dh-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.