datahaskell / dh-core Goto Github PK
View Code? Open in Web Editor NEWFunctional data science
Functional data science
In #14 I have raised point that we need set of type classes for properties of numbers and for approximate comparison.
So what I think we need is type class for things like machine epsilon, maximal and minimal representable number, transfinite numbers, NaN handling etc.. And another type class for approximate comparison of values. Design space here is rather large so it would be good to collect current state of art and implementations in different languages and libraries
The set of transitive dependencies of analyze
is currently quite large:
base-compat-0.9.3: build
base-orphans-0.6: build
dlist-0.8.0.3: build
cabal-doctest-1.0.2: build
integer-logarithms-1.0.2: build
mtl-2.2.1: build
primitive-0.6.2.0: build
random-1.1: build
semigroups-0.18.3: build
stm-2.4.4.1: build
text-1.2.2.2: build
time-locale-compat-0.1.1.3: build
StateVar-1.1.0.4: build
transformers-compat-0.5.1.4: build
vector-0.12.0.1: build
void-0.7.2: build
exceptions-0.8.3: build
contravariant-1.4: build
mmorph-1.0.9: build
tagged-0.8.5: build
distributive-0.5.2: build
comonad-5.0.1: build
bifunctors-5.4.2: build
profunctors-5.2: build
semigroupoids-5.2: build
free-4.12.4: build
blaze-builder-0.4.0.2: build
hashable-1.2.6.1: build
scientific-0.3.5.1: build
unordered-containers-0.2.8.0: build
attoparsec-0.13.1.0: build
uuid-types-1.0.3: build
lucid-2.9.8.1: build
vector-th-unbox-0.2.1.6: build
math-functions-0.2.1.0: build
mwc-random-0.13.6.0: build
cassava-0.4.5.1: build
aeson-1.1.2.0: build
foldl-1.2.5: build
For example I would like to understand whether free
(which brings in a few dependencies) is really necessary or can be removed, in favour of a simpler (if ad-hoc) solution.
When running stack bench
I get
bench: /Users/ocramz/.cache/datasets-hs/cifar-10-imagefolder/Truck: getDirectoryContents:openDirStream: does not exist (No such file or directory)
I guess it's a matter of copying the test data in a temporary directory before these tests.
As separate sub-project
Users who don't use stack
might have a hard time building this project, so (lower) version bounds should be added to all contributed packages and to dh-core itself.
Possibly a binary in the app/ folder with an end-to-end workflow. Then we can split back anything good that comes out of this into the main library
I'm looking to write a data loader which gets image datasets from disk in the same way pytorch's DataLoader class does. It should have the option to load images in batches. Originally this was going to go into hasktorch, but I think it might be better served in datasets
-- what do you think? This could be done in isolation from most of datasets
, but one small wrinkle is that the code could be written to also fetch datasets like CIFAR-10 or MNIST (which I would assume is ideal) -- in that case there might be some overlap with getFileFromSource
and some refactoring might be nice (like multithreaded downloads).
Does this sound like a good contribution?
The netflix dataset seems to be still available in the public domain via kaggle:
https://www.kaggle.com/netflix-inc/netflix-prize-data
contrary to the comment in the corresponding data loader:
The Netflix Prize dataset uses a custom parser because one data example does not fit into a single dataset row (such as CSV data) but has a custom "stanza-based" format. For example, these are two stanzas of the "qualifying.txt" data file :
1:
1046323,2005-12-19
1080030,2005-12-23
2127527,2005-12-04
1944918,2005-10-05
1057066,2005-11-07
954049,2005-12-20
10:
12868,2004-10-19
627923,2005-12-16
690763,2005-12-13
It would be nice to upgrade the library such that it can deal with these cases
The datasets downloader could use the above improvements: verifying downloads with hashes, and multithreading large downloads. I've written a version of the first feature in the Setup.ht
for a personal moby/dictd replacement, so it might look something like the below:
https://gist.github.com/stites/82acb2036d1654b0ef0c34ec4443579b
Reconstruction of the full original MNIST image set.
Currently we build against Stackage LTS 11.22 but some dependencies (e.g. req
) changed in a non-backward compatible way.
Fix: upgrade to Stackage nightly for now until the next LTS comes out.
Describe the bug
UCI ML Repository link http://mlr.cs.umass.edu/ml/datasets/housing is down and request of BostonHousing dataset is throwing an exception:
*** Exception: VanillaHttpException (HttpExceptionRequest Request {
host = "mlr.cs.umass.edu"
port = 80
secure = False
requestHeaders = []
path = "/ml/machine-learning-databases/housing/housing.data"
queryString = ""
method = "GET"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
proxySecureMode = ProxySecureWithConnect
}
(ConnectionFailure Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 0, addrAddress = 0.0.0.0:0, addrCanonName = Nothing}, host name: Just "mlr.cs.umass.edu", service name: Just "80"): does not exist (nodename nor servname provided, or not known)))
To Reproduce
Steps to reproduce the behavior:
import Numeric.Datasets (getDataset)
import Numeric.Datasets.BostonHousing (bostonHousing)
bh <- getDataset bostonHousing
Expected behavior
Loads the Boston Housing dataset into memory as the object bh.
Screenshots
N/A
Desktop (please complete the following information):
Smartphone (please complete the following information):
N/A
Additional context
Line below needs to be updated to use uciMLDB
Reference: some data sets URLs corrected on #67
cassava
/ sv
hnetcdf
( https://github.com/ian-ross/hnetcdf )hedis
esqueleto
beam
Code coverage should be added to the Travis config (perhaps the cabal file and/or the stack options need to be changed in order to account for hpc
coverage generation); currently in Travis there is only a project key.
This tool uploads hpc coverage reports to codecov.io .
Looking over the Dataloader
code, I immediately thought about integrating a private dataset to play with some haskell code. This made me wonder if anyone has thought about adding a cross-validation layer on top of it. For some cannonical datasets, there are defined splits (test, train, validation), but for others, one would need to define these.
It would be nice if there were some code that could allow one to partition some given data according to k-folds
and leave-p-out
. In the case of timeseries datasets, you'd have to make sure the that the partitions respect the temporal ordering.
currently, the parsers error
and fail
here and there. Since these are synchronous exceptions, it would be better to use MonadThrow, which can be conveniently used at a "pure" type such as Maybe or Either.
exceptions
as dependencyerror
and fail
into calls to throwM
I've started adding some code from my decision-trees
project under the Core.Numeric.Statistics
and Core.Data
namespaces . There is some machinery that could be re-used (for example the Dataset
abstraction for labeled data and some information theory functionals).
See 6bba752
Some references :
It's way for emulating not quite quad-precision number using two doubles. Algorithms is interesting by itself and could have few uses. But I think its main value is providing example of constant size approximation of real numbers which isn't IEEE754. It would be very useful for implementing type classes for working with low level representations of numbers. Without such examples it's all to easy to assume that only single and double IEEE754 numbers exist
Julia implementation and references could be found here https://github.com/JuliaMath/DoubleDouble.jl
[14 of 14] Compiling Main ( test/Spec.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/spec/spec-tmp/Main.o )
dh-core/analyze/test/Spec.hs:30:14: error:
• Ambiguous type variable ‘e0’ arising from a use of ‘catch’
prevents the constraint ‘(Exception e0)’ from being solved.
Probable fix: use a type annotation to specify what ‘e0’ should be.
These potential instances exist:
instance Exception SomeException -- Defined in ‘GHC.Exception’
instance Exception A.ColSizeMismatch
-- Defined at src/Analyze/Common.hs:37:10
instance (Show k,
base-4.11.1.0:Data.Typeable.Internal.Typeable k) =>
Exception (A.DuplicateKeyError k)
-- Defined at src/Analyze/Common.hs:33:10
...plus five others
...plus 17 instances involving out-of-scope types
(use -fprint-potential-instances to see them all)
• In the expression: catch (action >> return P.succeeded) handler
In an equation for ‘tester’:
tester = catch (action >> return P.succeeded) handler
In an equation for ‘propertyIO’:
propertyIO action
= ioProperty tester
where
tester :: IO P.Result
tester = catch (action >> return P.succeeded) handler
handler (HUnitFailure err) = return P.failed {P.reason = err}
|
30 | tester = catch (action >> return P.succeeded) handler
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dh-core/analyze/test/Spec.hs:31:14: error:
• The constructor ‘HUnitFailure’ should have 2 arguments, but has been given 1
• In the pattern: HUnitFailure err
In an equation for ‘handler’:
handler (HUnitFailure err) = return P.failed {P.reason = err}
In an equation for ‘propertyIO’:
propertyIO action
= ioProperty tester
where
tester :: IO P.Result
tester = catch (action >> return P.succeeded) handler
handler (HUnitFailure err) = return P.failed {P.reason = err}
|
31 | handler (HUnitFailure err) = return P.failed { P.reason = err }
| ^^^^^^^^^^^^^^^^
Latest req
doesn't use it (https://hackage.haskell.org/package/req), so we don't need it either.
datasets
is very inconsistent with lots of extra whitespace which causes terrible diffs. I think it, as well as dh-core, needs linting rules for consistency when developing -- also possibly github hooks to reject PRs that don't adhere.
brittany is the current standard for haskell-ide-engine, is very flexible, and has a style I'm familiar with -- so that would be my vote. If anyone has alternatives I think they should mention it here.
linting the codebase basically requires everyone to sync up on branches. Luckily there are only four forks; we should try to sync up here to eliminate the global number of rebases that will be required after a linting commit.
datasets
-> wreq
-> lens-aeson
-> lens
The RFrame type currently stores the frame entries as a Vector of Vectors (each inner vector being a data row). It would be nice to evaluate the performance of this way of storing with that of a streaming library (e.g. Stream (Of (Vector v)) m ()
).
The current problem is as follows:
(U.sum . flip M.column 0) a
does not fuse. It seems to boil down to:
testRewrite1 :: Matrix -> Double --fuses
testRewrite1 (Matrix r c v) = U.sum . flip (\u j -> U.generate r (\i -> u `U.unsafeIndex` (j + i * c))) 0 $ v
testRewrite2 :: Matrix -> Double -- does NOT fuse
testRewrite2 m = U.sum . flip (\(Matrix r c v) j -> U.generate r (\i -> v `U.unsafeIndex` (j + i * c))) 0 $ m
note: the flip
isn't important, it's just by convenience, since this is from https://github.com/Magalame/fastest-matrices
So the thing that seems to happen is that stream fusion cannot "go through" Matrix
's, I'm not sure exactly why
As sub-project
Better to move parts of the backend and the typeclass interface for now
Some unit tests asserting e.g. the length or some other property of the datasets would be nice to have.
Is your feature request related to a problem? Please describe.
We cannot have criterion
benchmarks for dense-linear-algebra
since there is this dependency chain :
criterion -> statistics -> dense-linear-algebra
chronos-bench
doesn't depend on dense-linear-algebra
^_^
https://hackage.haskell.org/package/chronos-bench
Describe the solution you'd like
Have some performance benchmarks
Describe alternatives you've considered
There might be alternative benchmarking packages not based on dense-linear-algebra
dh-core
as a whole with latest changesAn ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes.
https://www.cs.waikato.ac.nz/ml/weka/arff.html
Overview
ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information.
The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data), and their types. An example header on the standard IRIS dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
The Data of the ARFF file looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments. The @RELATION
, @ATTRIBUTE
and @DATA
declarations are case insensitive.
I guess all packages should build with one consistent library set. Correct me if I am wrong @ocramz .
The projects fails to build on macOS 10.14.3 (Mojave).
Compiler complains it cannot find headers and binaries for zlib
and curl
libraries.
To Reproduce
Steps to reproduce the behavior:
git clone [email protected]:DataHaskell/dh-core.git
cd dh-core
git checkout a2ad2552e8525acf0ace12069d29f333d1793f05
cd dh-core
stack build --no-nix
stack clean
, probably because built library is cached somewhere. Will try to reproduce the error with Travis build.Workarounds
brew install curl zlib
stack build \
--extra-include-dirs=/usr/local/opt/curl/include --extra-lib-dirs=/usr/local/opt/curl/lib \
--extra-include-dirs=/usr/local/opt/zlib/include --extra-lib-dirs=/usr/local/opt/zlib/lib
dh-core/stack.yaml
:nix:
enable: true
packages:
- curl
- zlib
and run stack build
Environment
The generateSym
function is defined as:
generateSym :: Int -> (Int -> Int -> Double) -> Matrix
generateSym n f = runST $ do
m <- unsafeNew n n
for 0 n $ \r -> do
unsafeWrite m r r (f r r)
for (r+1) n $ \c -> do
let x = f r c
unsafeWrite m r c x
unsafeWrite m c r x
unsafeFreeze m
Running it with n=100
, I noted we can note that the function allocates ~ 160 000 bytes of memory, which is around twice what we would expect when allocating one Matrix
.
This allocation seems to be related to the dependence on c
of x
. If we change f r c
to f r r
, the allocation drops 80 000 bytes, and the runtime is divided by two.
Text fixtures (e.g. analyze/test/Fixtures.hs )could be gradually replaced by test properties, based on quickcheck, hedgehog or genvalidity.
As sub-projects since they are already published
As sub-project
SIMD instructions seem to be of great importance in the performances of a linear algebra library. The big question then is how to incorporate them to the rest of the library?
I've had some success with a fork of simd
: https://github.com/Magalame/simd
Another question to solve would be how to:
I'm working on a hasktorch example with fashion-mnist, and @stites suggested adding the dataset to datasets
, which I think is pretty useful!
Referring to this issue over at hasktorch
: hasktorch/hasktorch#102
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.