datahaskell / dh-core Goto Github PK

View Code? Open in Web Editor NEW

135.0 16.0 23.0 1.21 MB

Functional data science

Haskell 100.00%

data-science machine-learning numerical-methods data-analysis data-mining datahaskell datasets dataframes

dh-core's People

Contributors

Stargazers

Watchers

dh-core's Issues

Floating point and approximate comparison

In #14 I have raised point that we need set of type classes for properties of numbers and for approximate comparison.

So what I think we need is type class for things like machine epsilon, maximal and minimal representable number, transfinite numbers, NaN handling etc.. And another type class for approximate comparison of values. Design space here is rather large so it would be good to collect current state of art and implementations in different languages and libraries

Move CI from Travis to Github Actions

analyze: reduce transitive dependencies

The set of transitive dependencies of analyze is currently quite large:

base-compat-0.9.3: build
base-orphans-0.6: build
dlist-0.8.0.3: build
cabal-doctest-1.0.2: build
integer-logarithms-1.0.2: build
mtl-2.2.1: build
primitive-0.6.2.0: build
random-1.1: build
semigroups-0.18.3: build
stm-2.4.4.1: build
text-1.2.2.2: build
time-locale-compat-0.1.1.3: build
StateVar-1.1.0.4: build
transformers-compat-0.5.1.4: build
vector-0.12.0.1: build
void-0.7.2: build
exceptions-0.8.3: build
contravariant-1.4: build
mmorph-1.0.9: build
tagged-0.8.5: build
distributive-0.5.2: build
comonad-5.0.1: build
bifunctors-5.4.2: build
profunctors-5.2: build
semigroupoids-5.2: build
free-4.12.4: build
blaze-builder-0.4.0.2: build
hashable-1.2.6.1: build
scientific-0.3.5.1: build
unordered-containers-0.2.8.0: build
attoparsec-0.13.1.0: build
uuid-types-1.0.3: build
lucid-2.9.8.1: build
vector-th-unbox-0.2.1.6: build
math-functions-0.2.1.0: build
mwc-random-0.13.6.0: build
cassava-0.4.5.1: build
aeson-1.1.2.0: build
foldl-1.2.5: build

For example I would like to understand whether free (which brings in a few dependencies) is really necessary or can be removed, in favour of a simpler (if ad-hoc) solution.

datasets: fix benchmark dataset folder

When running stack bench I get

bench: /Users/ocramz/.cache/datasets-hs/cifar-10-imagefolder/Truck: getDirectoryContents:openDirStream: does not exist (No such file or directory)

I guess it's a matter of copying the test data in a temporary directory before these tests.

Add `datasets`

As separate sub-project

add (lower) dependency bounds

Users who don't use stack might have a hard time building this project, so (lower) version bounds should be added to all contributed packages and to dh-core itself.

analyze: add usage example(s)

Possibly a binary in the app/ folder with an end-to-end workflow. Then we can split back anything good that comes out of this into the main library

Add dataloader for large datasets

I'm looking to write a data loader which gets image datasets from disk in the same way pytorch's DataLoader class does. It should have the option to load images in batches. Originally this was going to go into hasktorch, but I think it might be better served in datasets -- what do you think? This could be done in isolation from most of datasets, but one small wrinkle is that the code could be written to also fetch datasets like CIFAR-10 or MNIST (which I would assume is ideal) -- in that case there might be some overlap with getFileFromSource and some refactoring might be nice (like multithreaded downloads).

Does this sound like a good contribution?

datasets : Netflix data is still available via kaggle

The netflix dataset seems to be still available in the public domain via kaggle:

https://www.kaggle.com/netflix-inc/netflix-prize-data

contrary to the comment in the corresponding data loader:

dh-core/datasets/src/Numeric/Datasets/Netflix.hs

Line 9 in bd06214

 The competition ended on September, 2009, and the dataset was subsequently removed from the public domain by the company (see <http://netflixprize.com/>). 

datasets : harmonize Netflix parsers with the rest

The Netflix Prize dataset uses a custom parser because one data example does not fit into a single dataset row (such as CSV data) but has a custom "stanza-based" format. For example, these are two stanzas of the "qualifying.txt" data file :

1:
1046323,2005-12-19
1080030,2005-12-23
2127527,2005-12-04
1944918,2005-10-05
1057066,2005-11-07
954049,2005-12-20
10:
12868,2004-10-19
627923,2005-12-16
690763,2005-12-13

It would be nice to upgrade the library such that it can deal with these cases

Solution sketch:

Add one constructor to ReadAs that can accept an attoparsec parser as parameter

datasets : verify downloads with hashes, multithread large downloads

The datasets downloader could use the above improvements: verifying downloads with hashes, and multithreading large downloads. I've written a version of the first feature in the Setup.ht for a personal moby/dictd replacement, so it might look something like the below:

https://gist.github.com/stites/82acb2036d1654b0ef0c34ec4443579b

Add QMNIST

Reconstruction of the full original MNIST image set.

https://github.com/facebookresearch/qmnist

dense-linear-algebra: add tests

Bump Stackage to latest Nightly

Currently we build against Stackage LTS 11.22 but some dependencies (e.g. req) changed in a non-backward compatible way.
Fix: upgrade to Stackage nightly for now until the next LTS comes out.

BostonHousing data set URL needs to be updated.

Describe the bug
UCI ML Repository link http://mlr.cs.umass.edu/ml/datasets/housing is down and request of BostonHousing dataset is throwing an exception:

*** Exception: VanillaHttpException (HttpExceptionRequest Request {
  host                 = "mlr.cs.umass.edu"
  port                 = 80
  secure               = False
  requestHeaders       = []
  path                 = "/ml/machine-learning-databases/housing/housing.data"
  queryString          = ""
  method               = "GET"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
  proxySecureMode      = ProxySecureWithConnect
}
 (ConnectionFailure Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 0, addrAddress = 0.0.0.0:0, addrCanonName = Nothing}, host name: Just "mlr.cs.umass.edu", service name: Just "80"): does not exist (nodename nor servname provided, or not known)))

To Reproduce
Steps to reproduce the behavior:

On GHCi, type import Numeric.Datasets (getDataset)
Type import Numeric.Datasets.BostonHousing (bostonHousing)
Type bh <- getDataset bostonHousing
See error

Expected behavior
Loads the Boston Housing dataset into memory as the object bh.

Screenshots
N/A

Desktop (please complete the following information):

OS: macOS
GHC version 8.10.4

Smartphone (please complete the following information):
N/A

Additional context
Line below needs to be updated to use uciMLDB

dh-core/datasets/src/Numeric/Datasets/BostonHousing.hs

Line 64 in 2beb874

csvDataset $ URL $ umassMLDB /: "housing" /: "housing.data"

Reference: some data sets URLs corrected on #67

Dataframe backends

Tabular data:
- CSV/TSV :
  - cassava / sv
- ARFF (see #38 )
- NetCDF:
  - hnetcdf ( https://github.com/ian-ross/hnetcdf )
- HDF5 :
  - ?
Databases
- Redis :
  - hedis
- PostgreSQL:
  - esqueleto
  - beam

Add test coverage

Code coverage should be added to the Travis config (perhaps the cabal file and/or the stack options need to be changed in order to account for hpc coverage generation); currently in Travis there is only a project key.
This tool uploads hpc coverage reports to codecov.io .

Cross validation layer

Looking over the Dataloader code, I immediately thought about integrating a private dataset to play with some haskell code. This made me wonder if anyone has thought about adding a cross-validation layer on top of it. For some cannonical datasets, there are defined splits (test, train, validation), but for others, one would need to define these.

It would be nice if there were some code that could allow one to partition some given data according to k-folds and leave-p-out. In the case of timeseries datasets, you'd have to make sure the that the partitions respect the temporal ordering.

datasets : add exceptions

currently, the parsers error and fail here and there. Since these are synchronous exceptions, it would be better to use MonadThrow, which can be conveniently used at a "pure" type such as Maybe or Either.

add exceptions as dependency
import Control.Monad.Catch (MonadThrow(..))
declare some parsing exceptions type (which require Typable and Exception instances, see https://www.fpcomplete.com/blog/2016/11/exceptions-best-practices-haskell)
convert the calls to error and fail into calls to throwM

Algorithms : Classification : Decision trees

I've started adding some code from my decision-trees project under the Core.Numeric.Statistics and Core.Data namespaces . There is some machinery that could be re-used (for example the Dataset abstraction for labeled data and some information theory functionals).

See 6bba752

Statistics API

Some references :

StatsBase.jl http://juliastats.github.io/StatsBase.jl/stable/index.html

Implement DoubleDouble

It's way for emulating not quite quad-precision number using two doubles. Algorithms is interesting by itself and could have few uses. But I think its main value is providing example of constant size approximation of real numbers which isn't IEEE754. It would be very useful for implementing type classes for working with low level representations of numbers. Without such examples it's all to easy to assume that only single and double IEEE754 numbers exist

Julia implementation and references could be found here https://github.com/JuliaMath/DoubleDouble.jl

analyze: test failure

    [14 of 14] Compiling Main             ( test/Spec.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/spec/spec-tmp/Main.o )
    
   dh-core/analyze/test/Spec.hs:30:14: error:
        • Ambiguous type variable ‘e0’ arising from a use of ‘catch’
          prevents the constraint ‘(Exception e0)’ from being solved.
          Probable fix: use a type annotation to specify what ‘e0’ should be.
          These potential instances exist:
            instance Exception SomeException -- Defined in ‘GHC.Exception’
            instance Exception A.ColSizeMismatch
              -- Defined at src/Analyze/Common.hs:37:10
            instance (Show k,
                      base-4.11.1.0:Data.Typeable.Internal.Typeable k) =>
                     Exception (A.DuplicateKeyError k)
              -- Defined at src/Analyze/Common.hs:33:10
            ...plus five others
            ...plus 17 instances involving out-of-scope types
            (use -fprint-potential-instances to see them all)
        • In the expression: catch (action >> return P.succeeded) handler
          In an equation for ‘tester’:
              tester = catch (action >> return P.succeeded) handler
          In an equation for ‘propertyIO’:
              propertyIO action
                = ioProperty tester
                where
                    tester :: IO P.Result
                    tester = catch (action >> return P.succeeded) handler
                    handler (HUnitFailure err) = return P.failed {P.reason = err}
       |
    30 |     tester = catch (action >> return P.succeeded) handler
       |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    dh-core/analyze/test/Spec.hs:31:14: error:
        • The constructor ‘HUnitFailure’ should have 2 arguments, but has been given 1
        • In the pattern: HUnitFailure err
          In an equation for ‘handler’:
              handler (HUnitFailure err) = return P.failed {P.reason = err}
          In an equation for ‘propertyIO’:
              propertyIO action
                = ioProperty tester
                where
                    tester :: IO P.Result
                    tester = catch (action >> return P.succeeded) handler
                    handler (HUnitFailure err) = return P.failed {P.reason = err}
       |
    31 |     handler (HUnitFailure err) = return P.failed { P.reason = err }
       |              ^^^^^^^^^^^^^^^^

datasets: remove data-default-class

Latest req doesn't use it (https://hackage.haskell.org/package/req), so we don't need it either.

datasets : lint with brittany

datasets is very inconsistent with lots of extra whitespace which causes terrible diffs. I think it, as well as dh-core, needs linting rules for consistency when developing -- also possibly github hooks to reject PRs that don't adhere.

brittany is the current standard for haskell-ide-engine, is very flexible, and has a style I'm familiar with -- so that would be my vote. If anyone has alternatives I think they should mention it here.

linting the codebase basically requires everyone to sync up on branches. Luckily there are only four forks; we should try to sync up here to eliminate the global number of rebases that will be required after a linting commit.

datasets: reduce transitive dependencies

datasets -> wreq -> lens-aeson -> lens

analyze : evaluate `streaming` for RFrame

The RFrame type currently stores the frame entries as a Vector of Vectors (each inner vector being a data row). It would be nice to evaluate the performance of this way of storing with that of a streaming library (e.g. Stream (Of (Vector v)) m ()).

Move `dh-core` into its own subdirectory

dense-linear-algebra: Getting stream fusion to work across `Matrix`'s

The current problem is as follows:

(U.sum . flip M.column 0) a does not fuse. It seems to boil down to:

testRewrite1 :: Matrix -> Double  --fuses
testRewrite1 (Matrix r c v) = U.sum . flip (\u j -> U.generate r (\i -> u `U.unsafeIndex` (j + i * c))) 0 $ v

testRewrite2 :: Matrix -> Double -- does NOT fuse
testRewrite2 m = U.sum . flip (\(Matrix r c v) j -> U.generate r (\i -> v `U.unsafeIndex` (j + i * c))) 0 $ m

note: the flip isn't important, it's just by convenience, since this is from https://github.com/Magalame/fastest-matrices

So the thing that seems to happen is that stream fusion cannot "go through" Matrix's, I'm not sure exactly why

Move in `sparse-linear-algebra`

~~As sub-project~~

Better to move parts of the backend and the typeclass interface for now

datasets: add unit tests

Some unit tests asserting e.g. the length or some other property of the datasets would be nice to have.

dense-linear-algebra : Add chronos-bench benchmarks

Is your feature request related to a problem? Please describe.
We cannot have criterion benchmarks for dense-linear-algebra since there is this dependency chain :

criterion -> statistics -> dense-linear-algebra

chronos-bench doesn't depend on dense-linear-algebra ^_^

https://hackage.haskell.org/package/chronos-bench

Describe the solution you'd like
Have some performance benchmarks

Describe alternatives you've considered
There might be alternative benchmarking packages not based on dense-linear-algebra

Cut new release

test dh-core as a whole with latest changes
releases :
- datasets
- ?

datasets : add ARFF format

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes.

https://www.cs.waikato.ac.nz/ml/weka/arff.html

Overview

ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data), and their types. An example header on the standard IRIS dataset looks like this:

   % 1. Title: Iris Plants Database
   % 
   % 2. Sources:
   %      (a) Creator: R.A. Fisher
   %      (b) Donor: Michael Marshall (MARSHALL%[email protected])
   %      (c) Date: July, 1988
   % 
   @RELATION iris

   @ATTRIBUTE sepallength  NUMERIC
   @ATTRIBUTE sepalwidth   NUMERIC
   @ATTRIBUTE petallength  NUMERIC
   @ATTRIBUTE petalwidth   NUMERIC
   @ATTRIBUTE class        {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:

   @DATA
   5.1,3.5,1.4,0.2,Iris-setosa
   4.9,3.0,1.4,0.2,Iris-setosa
   4.7,3.2,1.3,0.2,Iris-setosa
   4.6,3.1,1.5,0.2,Iris-setosa
   5.0,3.6,1.4,0.2,Iris-setosa
   5.4,3.9,1.7,0.4,Iris-setosa
   4.6,3.4,1.4,0.3,Iris-setosa
   5.0,3.4,1.5,0.2,Iris-setosa
   4.4,2.9,1.4,0.2,Iris-setosa
   4.9,3.1,1.5,0.1,Iris-setosa

Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

datasets : split off datasets-core

Medium-long term : the loading/parsing machinery is growing in size and scope (see #22 , #29 ), so those functions and types could be gathered in a separate datasets-core package. datasets will import it and add the actual datasets. Any ideas?

build with top-level stack file

I guess all packages should build with one consistent library set. Correct me if I am wrong @ocramz .

Cannot build project on macOS 10.14.3

The projects fails to build on macOS 10.14.3 (Mojave).
Compiler complains it cannot find headers and binaries for zlib and curl libraries.

To Reproduce
Steps to reproduce the behavior:

git clone [email protected]:DataHaskell/dh-core.git
cd dh-core
git checkout a2ad2552e8525acf0ace12069d29f333d1793f05
cd dh-core
stack build --no-nix

See error in log. Cannot reproduce it right now after calling stack clean, probably because built library is cached somewhere. Will try to reproduce the error with Travis build.

Workarounds

Use libraries from Homebrew

brew install curl zlib
stack build \
    --extra-include-dirs=/usr/local/opt/curl/include --extra-lib-dirs=/usr/local/opt/curl/lib \
    --extra-include-dirs=/usr/local/opt/zlib/include --extra-lib-dirs=/usr/local/opt/zlib/lib

Use Nix to set up proper build environment. Add the following lines to dh-core/stack.yaml:

nix:
  enable: true
  packages: 
  - curl
  - zlib

and run stack build

Environment

OS: macOS 10.14.3 (Mojave)
Stack Version 1.9.3 x86_64 hpack-0.31.1

dense-linear-algebra : Weird memory and runtime behavior from `generateSym`

The generateSym function is defined as:

generateSym :: Int -> (Int -> Int -> Double) -> Matrix
generateSym n f = runST $ do
  m <- unsafeNew n n
  for 0 n $ \r -> do
    unsafeWrite m r r (f r r)
    for (r+1) n $ \c -> do
      let x = f r c
      unsafeWrite m r c x
      unsafeWrite m c r x
  unsafeFreeze m

Running it with n=100, I noted we can note that the function allocates ~ 160 000 bytes of memory, which is around twice what we would expect when allocating one Matrix.
This allocation seems to be related to the dependence on c of x. If we change f r c to f r r, the allocation drops 80 000 bytes, and the runtime is divided by two.

Another question to solve would be how to:

detect which instructions are available
handle conditional compilation (https://www.reddit.com/r/haskell/comments/c64hbr/tweag_io_cpp_considered_harmful/)
handle stream fusion when converting back and forth to simd vectors

BLAS layer

Unify dense and sparse lin.alg. , for a given underlying vector type, under one same interface

Blocked by #1 and #3

datasets: Add fashion-mnist

I'm working on a hasktorch example with fashion-mnist, and @stites suggested adding the dataset to datasets, which I think is pretty useful!

Referring to this issue over at hasktorch : hasktorch/hasktorch#102

datahaskell / dh-core Goto Github PK

dh-core's People

Contributors

Stargazers

Watchers

Forkers

dh-core's Issues

Solution sketch:

Recommend Projects

Recommend Topics

Recommend Org