Giter VIP home page Giter VIP logo

dh-core's People

Contributors

adlucem avatar arvindd avatar bos avatar kaizhang avatar lehins avatar lunaticare avatar magalame avatar mjarosie avatar mmesch avatar nandaleite avatar ocramz avatar raduom avatar shimuuar avatar stites avatar unkdeve avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dh-core's Issues

Floating point and approximate comparison

In #14 I have raised point that we need set of type classes for properties of numbers and for approximate comparison.

So what I think we need is type class for things like machine epsilon, maximal and minimal representable number, transfinite numbers, NaN handling etc.. And another type class for approximate comparison of values. Design space here is rather large so it would be good to collect current state of art and implementations in different languages and libraries

analyze: reduce transitive dependencies

The set of transitive dependencies of analyze is currently quite large:

base-compat-0.9.3: build
base-orphans-0.6: build
dlist-0.8.0.3: build
cabal-doctest-1.0.2: build
integer-logarithms-1.0.2: build
mtl-2.2.1: build
primitive-0.6.2.0: build
random-1.1: build
semigroups-0.18.3: build
stm-2.4.4.1: build
text-1.2.2.2: build
time-locale-compat-0.1.1.3: build
StateVar-1.1.0.4: build
transformers-compat-0.5.1.4: build
vector-0.12.0.1: build
void-0.7.2: build
exceptions-0.8.3: build
contravariant-1.4: build
mmorph-1.0.9: build
tagged-0.8.5: build
distributive-0.5.2: build
comonad-5.0.1: build
bifunctors-5.4.2: build
profunctors-5.2: build
semigroupoids-5.2: build
free-4.12.4: build
blaze-builder-0.4.0.2: build
hashable-1.2.6.1: build
scientific-0.3.5.1: build
unordered-containers-0.2.8.0: build
attoparsec-0.13.1.0: build
uuid-types-1.0.3: build
lucid-2.9.8.1: build
vector-th-unbox-0.2.1.6: build
math-functions-0.2.1.0: build
mwc-random-0.13.6.0: build
cassava-0.4.5.1: build
aeson-1.1.2.0: build
foldl-1.2.5: build

For example I would like to understand whether free (which brings in a few dependencies) is really necessary or can be removed, in favour of a simpler (if ad-hoc) solution.

datasets: fix benchmark dataset folder

When running stack bench I get

bench: /Users/ocramz/.cache/datasets-hs/cifar-10-imagefolder/Truck: getDirectoryContents:openDirStream: does not exist (No such file or directory)

I guess it's a matter of copying the test data in a temporary directory before these tests.

add (lower) dependency bounds

Users who don't use stack might have a hard time building this project, so (lower) version bounds should be added to all contributed packages and to dh-core itself.

analyze: add usage example(s)

Possibly a binary in the app/ folder with an end-to-end workflow. Then we can split back anything good that comes out of this into the main library

Add dataloader for large datasets

I'm looking to write a data loader which gets image datasets from disk in the same way pytorch's DataLoader class does. It should have the option to load images in batches. Originally this was going to go into hasktorch, but I think it might be better served in datasets -- what do you think? This could be done in isolation from most of datasets, but one small wrinkle is that the code could be written to also fetch datasets like CIFAR-10 or MNIST (which I would assume is ideal) -- in that case there might be some overlap with getFileFromSource and some refactoring might be nice (like multithreaded downloads).

Does this sound like a good contribution?

datasets : harmonize Netflix parsers with the rest

The Netflix Prize dataset uses a custom parser because one data example does not fit into a single dataset row (such as CSV data) but has a custom "stanza-based" format. For example, these are two stanzas of the "qualifying.txt" data file :

1:
1046323,2005-12-19
1080030,2005-12-23
2127527,2005-12-04
1944918,2005-10-05
1057066,2005-11-07
954049,2005-12-20
10:
12868,2004-10-19
627923,2005-12-16
690763,2005-12-13

It would be nice to upgrade the library such that it can deal with these cases

Solution sketch:

  • Add one constructor to ReadAs that can accept an attoparsec parser as parameter

Bump Stackage to latest Nightly

Currently we build against Stackage LTS 11.22 but some dependencies (e.g. req) changed in a non-backward compatible way.
Fix: upgrade to Stackage nightly for now until the next LTS comes out.

BostonHousing data set URL needs to be updated.

Describe the bug
UCI ML Repository link http://mlr.cs.umass.edu/ml/datasets/housing is down and request of BostonHousing dataset is throwing an exception:

*** Exception: VanillaHttpException (HttpExceptionRequest Request {
  host                 = "mlr.cs.umass.edu"
  port                 = 80
  secure               = False
  requestHeaders       = []
  path                 = "/ml/machine-learning-databases/housing/housing.data"
  queryString          = ""
  method               = "GET"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
  proxySecureMode      = ProxySecureWithConnect
}
 (ConnectionFailure Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 0, addrAddress = 0.0.0.0:0, addrCanonName = Nothing}, host name: Just "mlr.cs.umass.edu", service name: Just "80"): does not exist (nodename nor servname provided, or not known)))

To Reproduce
Steps to reproduce the behavior:

  1. On GHCi, type import Numeric.Datasets (getDataset)
  2. Type import Numeric.Datasets.BostonHousing (bostonHousing)
  3. Type bh <- getDataset bostonHousing
  4. See error

Expected behavior
Loads the Boston Housing dataset into memory as the object bh.

Screenshots
N/A

Desktop (please complete the following information):

  • OS: macOS
  • GHC version 8.10.4

Smartphone (please complete the following information):
N/A

Additional context
Line below needs to be updated to use uciMLDB

csvDataset $ URL $ umassMLDB /: "housing" /: "housing.data"

Reference: some data sets URLs corrected on #67

Add test coverage

Code coverage should be added to the Travis config (perhaps the cabal file and/or the stack options need to be changed in order to account for hpc coverage generation); currently in Travis there is only a project key.
This tool uploads hpc coverage reports to codecov.io .

Cross validation layer

Looking over the Dataloader code, I immediately thought about integrating a private dataset to play with some haskell code. This made me wonder if anyone has thought about adding a cross-validation layer on top of it. For some cannonical datasets, there are defined splits (test, train, validation), but for others, one would need to define these.

It would be nice if there were some code that could allow one to partition some given data according to k-folds and leave-p-out. In the case of timeseries datasets, you'd have to make sure the that the partitions respect the temporal ordering.

datasets : add exceptions

currently, the parsers error and fail here and there. Since these are synchronous exceptions, it would be better to use MonadThrow, which can be conveniently used at a "pure" type such as Maybe or Either.

  1. add exceptions as dependency
  2. import Control.Monad.Catch (MonadThrow(..))
  3. declare some parsing exceptions type (which require Typable and Exception instances, see https://www.fpcomplete.com/blog/2016/11/exceptions-best-practices-haskell)
  4. convert the calls to error and fail into calls to throwM

Algorithms : Classification : Decision trees

I've started adding some code from my decision-trees project under the Core.Numeric.Statistics and Core.Data namespaces . There is some machinery that could be re-used (for example the Dataset abstraction for labeled data and some information theory functionals).

See 6bba752

Implement DoubleDouble

It's way for emulating not quite quad-precision number using two doubles. Algorithms is interesting by itself and could have few uses. But I think its main value is providing example of constant size approximation of real numbers which isn't IEEE754. It would be very useful for implementing type classes for working with low level representations of numbers. Without such examples it's all to easy to assume that only single and double IEEE754 numbers exist

Julia implementation and references could be found here https://github.com/JuliaMath/DoubleDouble.jl

analyze: test failure

    [14 of 14] Compiling Main             ( test/Spec.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/spec/spec-tmp/Main.o )
    
   dh-core/analyze/test/Spec.hs:30:14: error:
        • Ambiguous type variable ‘e0’ arising from a use of ‘catch’
          prevents the constraint ‘(Exception e0)’ from being solved.
          Probable fix: use a type annotation to specify what ‘e0’ should be.
          These potential instances exist:
            instance Exception SomeException -- Defined in ‘GHC.Exception’
            instance Exception A.ColSizeMismatch
              -- Defined at src/Analyze/Common.hs:37:10
            instance (Show k,
                      base-4.11.1.0:Data.Typeable.Internal.Typeable k) =>
                     Exception (A.DuplicateKeyError k)
              -- Defined at src/Analyze/Common.hs:33:10
            ...plus five others
            ...plus 17 instances involving out-of-scope types
            (use -fprint-potential-instances to see them all)
        • In the expression: catch (action >> return P.succeeded) handler
          In an equation for ‘tester’:
              tester = catch (action >> return P.succeeded) handler
          In an equation for ‘propertyIO’:
              propertyIO action
                = ioProperty tester
                where
                    tester :: IO P.Result
                    tester = catch (action >> return P.succeeded) handler
                    handler (HUnitFailure err) = return P.failed {P.reason = err}
       |
    30 |     tester = catch (action >> return P.succeeded) handler
       |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    dh-core/analyze/test/Spec.hs:31:14: error:
        • The constructor ‘HUnitFailure’ should have 2 arguments, but has been given 1
        • In the pattern: HUnitFailure err
          In an equation for ‘handler’:
              handler (HUnitFailure err) = return P.failed {P.reason = err}
          In an equation for ‘propertyIO’:
              propertyIO action
                = ioProperty tester
                where
                    tester :: IO P.Result
                    tester = catch (action >> return P.succeeded) handler
                    handler (HUnitFailure err) = return P.failed {P.reason = err}
       |
    31 |     handler (HUnitFailure err) = return P.failed { P.reason = err }
       |              ^^^^^^^^^^^^^^^^

datasets : lint with brittany

datasets is very inconsistent with lots of extra whitespace which causes terrible diffs. I think it, as well as dh-core, needs linting rules for consistency when developing -- also possibly github hooks to reject PRs that don't adhere.

brittany is the current standard for haskell-ide-engine, is very flexible, and has a style I'm familiar with -- so that would be my vote. If anyone has alternatives I think they should mention it here.

linting the codebase basically requires everyone to sync up on branches. Luckily there are only four forks; we should try to sync up here to eliminate the global number of rebases that will be required after a linting commit.

analyze : evaluate `streaming` for RFrame

The RFrame type currently stores the frame entries as a Vector of Vectors (each inner vector being a data row). It would be nice to evaluate the performance of this way of storing with that of a streaming library (e.g. Stream (Of (Vector v)) m ()).

dense-linear-algebra: Getting stream fusion to work across `Matrix`'s

The current problem is as follows:

(U.sum . flip M.column 0) a does not fuse. It seems to boil down to:

testRewrite1 :: Matrix -> Double  --fuses
testRewrite1 (Matrix r c v) = U.sum . flip (\u j -> U.generate r (\i -> u `U.unsafeIndex` (j + i * c))) 0 $ v

testRewrite2 :: Matrix -> Double -- does NOT fuse
testRewrite2 m = U.sum . flip (\(Matrix r c v) j -> U.generate r (\i -> v `U.unsafeIndex` (j + i * c))) 0 $ m

note: the flip isn't important, it's just by convenience, since this is from https://github.com/Magalame/fastest-matrices

So the thing that seems to happen is that stream fusion cannot "go through" Matrix's, I'm not sure exactly why

datasets: add unit tests

Some unit tests asserting e.g. the length or some other property of the datasets would be nice to have.

dense-linear-algebra : Add chronos-bench benchmarks

Is your feature request related to a problem? Please describe.
We cannot have criterion benchmarks for dense-linear-algebra since there is this dependency chain :

criterion -> statistics -> dense-linear-algebra

chronos-bench doesn't depend on dense-linear-algebra ^_^

https://hackage.haskell.org/package/chronos-bench

Describe the solution you'd like
Have some performance benchmarks

Describe alternatives you've considered
There might be alternative benchmarking packages not based on dense-linear-algebra

Cut new release

  • test dh-core as a whole with latest changes
  • releases :
    • datasets
    • ?

datasets : add ARFF format

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes.

https://www.cs.waikato.ac.nz/ml/weka/arff.html

Overview

ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data), and their types. An example header on the standard IRIS dataset looks like this:

   % 1. Title: Iris Plants Database
   % 
   % 2. Sources:
   %      (a) Creator: R.A. Fisher
   %      (b) Donor: Michael Marshall (MARSHALL%[email protected])
   %      (c) Date: July, 1988
   % 
   @RELATION iris

   @ATTRIBUTE sepallength  NUMERIC
   @ATTRIBUTE sepalwidth   NUMERIC
   @ATTRIBUTE petallength  NUMERIC
   @ATTRIBUTE petalwidth   NUMERIC
   @ATTRIBUTE class        {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:

   @DATA
   5.1,3.5,1.4,0.2,Iris-setosa
   4.9,3.0,1.4,0.2,Iris-setosa
   4.7,3.2,1.3,0.2,Iris-setosa
   4.6,3.1,1.5,0.2,Iris-setosa
   5.0,3.6,1.4,0.2,Iris-setosa
   5.4,3.9,1.7,0.4,Iris-setosa
   4.6,3.4,1.4,0.3,Iris-setosa
   5.0,3.4,1.5,0.2,Iris-setosa
   4.4,2.9,1.4,0.2,Iris-setosa
   4.9,3.1,1.5,0.1,Iris-setosa

Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

datasets : split off datasets-core

Medium-long term : the loading/parsing machinery is growing in size and scope (see #22 , #29 ), so those functions and types could be gathered in a separate datasets-core package. datasets will import it and add the actual datasets. Any ideas?

Cannot build project on macOS 10.14.3

The projects fails to build on macOS 10.14.3 (Mojave).
Compiler complains it cannot find headers and binaries for zlib and curl libraries.

To Reproduce
Steps to reproduce the behavior:

  1. Run
git clone [email protected]:DataHaskell/dh-core.git
cd dh-core
git checkout a2ad2552e8525acf0ace12069d29f333d1793f05
cd dh-core
stack build --no-nix
  1. See error in log. Cannot reproduce it right now after calling stack clean, probably because built library is cached somewhere. Will try to reproduce the error with Travis build.

Workarounds

  1. Use libraries from Homebrew
brew install curl zlib
stack build \
    --extra-include-dirs=/usr/local/opt/curl/include --extra-lib-dirs=/usr/local/opt/curl/lib \
    --extra-include-dirs=/usr/local/opt/zlib/include --extra-lib-dirs=/usr/local/opt/zlib/lib
  1. Use Nix to set up proper build environment. Add the following lines to dh-core/stack.yaml:
nix:
  enable: true
  packages: 
  - curl
  - zlib

and run stack build

Environment

  • OS: macOS 10.14.3 (Mojave)
  • Stack Version 1.9.3 x86_64 hpack-0.31.1

dense-linear-algebra : Weird memory and runtime behavior from `generateSym`

The generateSym function is defined as:

generateSym :: Int -> (Int -> Int -> Double) -> Matrix
generateSym n f = runST $ do
  m <- unsafeNew n n
  for 0 n $ \r -> do
    unsafeWrite m r r (f r r)
    for (r+1) n $ \c -> do
      let x = f r c
      unsafeWrite m r c x
      unsafeWrite m c r x
  unsafeFreeze m

Running it with n=100, I noted we can note that the function allocates ~ 160 000 bytes of memory, which is around twice what we would expect when allocating one Matrix.
This allocation seems to be related to the dependence on c of x. If we change f r c to f r r, the allocation drops 80 000 bytes, and the runtime is divided by two.

dense-linear-algebra: Add support for SIMD instructions

SIMD instructions seem to be of great importance in the performances of a linear algebra library. The big question then is how to incorporate them to the rest of the library?

I've had some success with a fork of simd: https://github.com/Magalame/simd

Another question to solve would be how to:

BLAS layer

Unify dense and sparse lin.alg. , for a given underlying vector type, under one same interface

Blocked by #1 and #3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.