Comments (10)
Leveraging the parsec format should make this pretty simple. @ocramz you also may want to monospace
those @ signs : )
from dh-core.
Hello @ocramz, I have almost done this so far:
https://github.com/arvindd/dh-core/blob/arvindd-arff-38/datasets/src/Numeric/Datasets/ArffParser.hs
This file is under the datasets folder:
https://github.com/arvindd/dh-core/tree/arvindd-arff-38
Basically, I added a module Numeric.Datasets.ArffParser
in the datasets, and also added some ARFF files in the datafiles. The module exposes one function: parseArff
, which returns a tuple (Relation name, Attributes, ARFF Records). The ARFF records are returned as [Maybe Dynamic]
- with Nothing
for missing values (i.e., the original ARFF contains a '?'), and Just <dynamic value>
for values. I hope you had the same idea :-)
As a proof of usage of this parser, I plan to add more datasets utilising the ARFF files that I have added also in the datafiles folder. I will need to expose more types from the ArffParser for this - for example, I will have to expose the Datatype
, Attribute
, etc.
While I am working on this, can you (or anybody else) do a cursory review of the code above and provide me any feedback that I can consider in the next commits?
PS: Please assign this feature / issue to me, and I shall send a pull-request after I finish.
from dh-core.
Thank you @arvindd ! First thing that stuck out for me is your use of Data.Dynamic, what's the motivation there?
from dh-core.
The reason I chose to use Dynamic
is because our parser will not know the type of fields in the record before-hand. It can only know the type after it parses the @attribute
section of the ARFF file.
Having known the type in the @attribute
section, the function fieldval
can parse the right type of field, and then put that into a Dynamic
element, so that we could have a list of Dynamic
s (in our case, Maybe Dynamic
to take care of missing values).
An alternative would have been to just parse the data-fields in the record as text, and then provide two "conversion" classes (such as FromArffRecord
and ToArffRecord
) - so that the users of these record can directly create their own data-types as instances of this class. This would have possibly been easier than the way I chose.
I chose the Dynamic
way so that readDataset
/ safeReadDataset
could directly extract the list into a vector, later. Also, using Dynamic
builds in type-checking into the parser as we already know the "expected" types of fields in the record (as given in the @attribute
section). Using the FromArffRecord
/ ToArffRecord
method will not use the available information in the @attribute
section.
Do you have other ideas here?
from dh-core.
@arvindd thank you for your explanation :) indeed, as you say Dynamic or similar is unescapable. An alternative would be a sum type of types (e.g. data Ty = TyInt Int | TyDouble Double | TyText Text
etc.) but this approach still would need to supply parsers (as I'm doing here basically https://github.com/ocramz/heidi/blob/master/src/Data/Generics/Encode/Internal.hs#L122 ).
You are essentially building yet another take on Haskell dataframes :)
The thing is that datasets
assumes instead fixed record types, so I guess I should just rename this ticket and your work could go in dh-core
proper.
from dh-core.
@ocramz your assumption of datasets having a fixed-record type is still valid. The users of the ArffParser will anyway model a known ARFF file using this parser and extracting the fields into their specific record-types - so, essentially, this falls in as a fixed-type dataset.
Your idea of using sum-types for this problem is very nice. The immediate benefit that I see from this is the fact that we could parse even non-primitive types directly from the records - since you supply the parsers yourself. By using Dynamic
, we now need another layer of "parsers" that parse the Dynamics
into record types that we need, anyway - as I mention above. That's essentially what is currently done with the CSV parser too: the current embedded iris (and similar) datasets are CSV files, which provide fields in the ByteString form, and the iris has a fromField
instance of the record to get the right field type.
Using Dynamic
will not solve this problem. We still need that extra-layer of parsers to extract values from Dynamic
into the right record fields (as is currently done in Iris.hs). What using Dynamic
solves is simply that we not only type-check within the parser (because we know the types of the fields from the @attribute
section), but also simply tuck the value into the Dynamic
element (instead of a ByteString
) so that it can just be extracted using fromDyn a :: <type>
in the dataset model file such as Iris.hs.
The current implementation of ArffParser also does not encode sample-classes into any form - it simply passes that as a ByteString
, but checks if the class is indeed valid using the information from @attribute
. So, for example, an Iris-Sentosa will just be parsed as a string-type - only that it is checked that such a class exist (using @attribute
information). The best way to encode this information too us by using dependent types :-) Or, Singleton as is currently the only way in Haskell for such problems. I just did not want to make this more complicated than it is already :-)
from dh-core.
Hello @ocramz, I have now committed the final code for this, including an example dataset (Diabetes.hs) that uses the new ARFF parser. Kindly review the same and suggest improvements / feedback.
Also, suggest anything I have to do before I could raise a pull-request for the same.
Here are the main files that were added:
There are modifications in these files:
from dh-core.
Hi @arvindd , thank you for this! I also appreciate a lot that you provided one usage example, which is clean and easy to follow (and extend too!).
It'd be most convenient for me (and/or any other reviewers) if we could discuss the changes on the core modules over a pull request.
from dh-core.
Thanks @ocramz, I have raised a PR #65 for this issue. Requested both yourself and @stites for a review as suggested based on the git-blame on the files changed.
All checks have passed after the build, and I get a message that I could merge even without a review being done. I also get a message that the merge can be automatically done as there are no merge conflicts.
Since I am not sure what is the process at DataHaskell regarding PRs - I have not yet merged (this is my first PR here :-)).
Kindly suggest if I could merge the PR now, or wait for your reviews to happen.
from dh-core.
@arvindd Thank you for your work, this can be closed now!
from dh-core.
Related Issues (20)
- datasets: add unit tests HOT 7
- datasets : split off datasets-core
- analyze : evaluate `streaming` for RFrame HOT 9
- analyze: generate and check random test data HOT 17
- datasets: remove data-default-class HOT 1
- datasets: Add fashion-mnist
- Cross validation layer HOT 1
- Bump Stackage to latest Nightly
- datasets: fix benchmark dataset folder HOT 1
- Add test coverage HOT 1
- Cannot build project on macOS 10.14.3 HOT 4
- dense-linear-algebra : Add chronos-bench benchmarks HOT 4
- dense-linear-algebra : Weird memory and runtime behavior from `generateSym` HOT 2
- dense-linear-algebra: Getting stream fusion to work across `Matrix`'s HOT 5
- Add QMNIST
- dense-linear-algebra: Add support for SIMD instructions
- BostonHousing data set URL needs to be updated. HOT 1
- Cut new release
- Move CI from Travis to Github Actions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dh-core.