An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a l

Leveraging the parsec format should make this pretty simple. <a class="user-mention no

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

datasets : add ARFF format about dh-core HOT 10 CLOSED

datahaskell commented on May 24, 2024

datasets : add ARFF format

from dh-core.

Comments (10)

stites commented on May 24, 2024

Leveraging the parsec format should make this pretty simple. @ocramz you also may want to monospace those @ signs : )

from dh-core.

arvindd commented on May 24, 2024

Hello @ocramz, I have almost done this so far:

https://github.com/arvindd/dh-core/blob/arvindd-arff-38/datasets/src/Numeric/Datasets/ArffParser.hs

This file is under the datasets folder:

https://github.com/arvindd/dh-core/tree/arvindd-arff-38

Basically, I added a module Numeric.Datasets.ArffParser in the datasets, and also added some ARFF files in the datafiles. The module exposes one function: parseArff, which returns a tuple (Relation name, Attributes, ARFF Records). The ARFF records are returned as [Maybe Dynamic] - with Nothing for missing values (i.e., the original ARFF contains a '?'), and Just <dynamic value> for values. I hope you had the same idea :-)

As a proof of usage of this parser, I plan to add more datasets utilising the ARFF files that I have added also in the datafiles folder. I will need to expose more types from the ArffParser for this - for example, I will have to expose the Datatype, Attribute, etc.

While I am working on this, can you (or anybody else) do a cursory review of the code above and provide me any feedback that I can consider in the next commits?

PS: Please assign this feature / issue to me, and I shall send a pull-request after I finish.

from dh-core.

ocramz commented on May 24, 2024

Thank you @arvindd ! First thing that stuck out for me is your use of Data.Dynamic, what's the motivation there?

from dh-core.

arvindd commented on May 24, 2024

The reason I chose to use Dynamic is because our parser will not know the type of fields in the record before-hand. It can only know the type after it parses the @attribute section of the ARFF file.

Having known the type in the @attribute section, the function fieldval can parse the right type of field, and then put that into a Dynamic element, so that we could have a list of Dynamics (in our case, Maybe Dynamic to take care of missing values).

An alternative would have been to just parse the data-fields in the record as text, and then provide two "conversion" classes (such as FromArffRecord and ToArffRecord) - so that the users of these record can directly create their own data-types as instances of this class. This would have possibly been easier than the way I chose.

I chose the Dynamic way so that readDataset / safeReadDataset could directly extract the list into a vector, later. Also, using Dynamic builds in type-checking into the parser as we already know the "expected" types of fields in the record (as given in the @attribute section). Using the FromArffRecord / ToArffRecord method will not use the available information in the @attribute section.

Do you have other ideas here?

from dh-core.

ocramz commented on May 24, 2024

@arvindd thank you for your explanation :) indeed, as you say Dynamic or similar is unescapable. An alternative would be a sum type of types (e.g. data Ty = TyInt Int | TyDouble Double | TyText Text etc.) but this approach still would need to supply parsers (as I'm doing here basically https://github.com/ocramz/heidi/blob/master/src/Data/Generics/Encode/Internal.hs#L122 ).

You are essentially building yet another take on Haskell dataframes :)

The thing is that datasets assumes instead fixed record types, so I guess I should just rename this ticket and your work could go in dh-core proper.

from dh-core.

arvindd commented on May 24, 2024

@ocramz your assumption of datasets having a fixed-record type is still valid. The users of the ArffParser will anyway model a known ARFF file using this parser and extracting the fields into their specific record-types - so, essentially, this falls in as a fixed-type dataset.

Your idea of using sum-types for this problem is very nice. The immediate benefit that I see from this is the fact that we could parse even non-primitive types directly from the records - since you supply the parsers yourself. By using Dynamic, we now need another layer of "parsers" that parse the Dynamics into record types that we need, anyway - as I mention above. That's essentially what is currently done with the CSV parser too: the current embedded iris (and similar) datasets are CSV files, which provide fields in the ByteString form, and the iris has a fromField instance of the record to get the right field type.

Using Dynamic will not solve this problem. We still need that extra-layer of parsers to extract values from Dynamic into the right record fields (as is currently done in Iris.hs). What using Dynamic solves is simply that we not only type-check within the parser (because we know the types of the fields from the @attribute section), but also simply tuck the value into the Dynamic element (instead of a ByteString) so that it can just be extracted using fromDyn a :: <type> in the dataset model file such as Iris.hs.

The current implementation of ArffParser also does not encode sample-classes into any form - it simply passes that as a ByteString, but checks if the class is indeed valid using the information from @attribute. So, for example, an Iris-Sentosa will just be parsed as a string-type - only that it is checked that such a class exist (using @attribute information). The best way to encode this information too us by using dependent types :-) Or, Singleton as is currently the only way in Haskell for such problems. I just did not want to make this more complicated than it is already :-)

from dh-core.

arvindd commented on May 24, 2024

Hello @ocramz, I have now committed the final code for this, including an example dataset (Diabetes.hs) that uses the new ARFF parser. Kindly review the same and suggest improvements / feedback.

Also, suggest anything I have to do before I could raise a pull-request for the same.

Here are the main files that were added:

There are modifications in these files:

from dh-core.

ocramz commented on May 24, 2024

Hi @arvindd , thank you for this! I also appreciate a lot that you provided one usage example, which is clean and easy to follow (and extend too!).

It'd be most convenient for me (and/or any other reviewers) if we could discuss the changes on the core modules over a pull request.

from dh-core.

arvindd commented on May 24, 2024

Thanks @ocramz, I have raised a PR #65 for this issue. Requested both yourself and @stites for a review as suggested based on the git-blame on the files changed.

All checks have passed after the build, and I get a message that I could merge even without a review being done. I also get a message that the merge can be automatically done as there are no merge conflicts.

Since I am not sure what is the process at DataHaskell regarding PRs - I have not yet merged (this is my first PR here :-)).

Kindly suggest if I could merge the PR now, or wait for your reviews to happen.

from dh-core.

ocramz commented on May 24, 2024

@arvindd Thank you for your work, this can be closed now!

from dh-core.

datasets : add ARFF format about dh-core HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent