Comments (17)
@Magalame yep!
from dh-core.
I think I'd like to try that one too.
If I understand properly, the idea is to replace every unit test by a property test? So in short, writing the Arbitrary
instances, and then the actual test part?
from dh-core.
@ocramz So I started writing the tests (it's a bit messy, I'll organise it more cleanly later): https://github.com/Magalame/dh-core/blob/replace-fixtures/analyze/test/Spec2.hs
It's my first time really experimenting with QuickCheck
and the like, so I was wondering if looked good so far?
from dh-core.
Almost completely set up, only missing the oneHot
part. I think I might be doing something wrong because the performance of some tests might seem a bit poor, with a x15 slowdown.
from dh-core.
@Magalame um, that sounds strange. Which tests run so slowly?
from dh-core.
When running them, I get that:
Test suite
Fixture: OK (0.10s)
+++ OK, passed 100 tests.
Row Decode: OK (0.15s)
+++ OK, passed 100 tests.
Drop: OK (0.11s)
+++ OK, passed 100 tests.
Keep: OK (0.10s)
+++ OK, passed 100 tests.
Update Empty: OK (1.83s) #all these going down
+++ OK, passed 100 tests.
Update Empty 2: OK (1.67s)
+++ OK, passed 100 tests.
Update Add: OK (1.72s)
+++ OK, passed 100 tests.
Update Overlap: OK (1.79s)
+++ OK, passed 100 tests.
Take Rows: OK (0.95s)
+++ OK, passed 100 tests.
Add Column: OK (2.01s)
+++ OK, passed 100 tests.
All 10 tests passed (10.43s)
I tried to profile it, here's the .prof
: https://ufile.io/1lqia . However it looks a bit cryptic to me.
from dh-core.
I think I'm having trouble dealing with oneHot
. Here is an example of what I obtain with stack runghc Spec2.hs
:
-----------------------
Original Update: RFrameUpdate {_rframeUpdateKeys = ["","zlh"], _rframeUpdateData = [[ValueDouble (-2.0136650956085296),ValueText "rs"],[ValueDouble 0.19461704868628563,ValueText "mx"]]}
-----
Original data: [[ValueDouble (-2.0136650956085296),ValueText "rs"],[ValueDouble 0.19461704868628563,ValueText "mx"]]
-----
Key to test: ""
-----
True/false value: ValueDouble 0.19461704868628563/ValueDouble (-2.0136650956085296)
-----
Expected result: RFrame {_rframeKeys = ["ValueDouble (-2.0136650956085296)","ValueDouble 0.19461704868628563"], _rframeLookup = fromList [("ValueDouble 0.19461704868628563",1),("ValueDouble (-2.0136650956085296)",0)], _rframeData = [[ValueDouble 0.19461704868628563,ValueDouble (-2.0136650956085296)],[ValueDouble (-2.0136650956085296),ValueDouble 0.19461704868628563]]}
-----
From A.oneHot: RFrame {_rframeKeys = ["zlh","ValueDouble (-2.0136650956085296)","ValueDouble 0.19461704868628563"], _rframeLookup = fromList [("ValueDouble 0.19461704868628563",2),("ValueDouble (-2.0136650956085296)",1),("zlh",0)], _rframeData = [[ValueText "rs",ValueDouble 0.19461704868628563,ValueDouble (-2.0136650956085296)],[ValueText "mx",ValueDouble (-2.0136650956085296),ValueDouble 0.19461704868628563]]}
There seems to be an extra "zlh"
column.
from dh-core.
Bump
from dh-core.
Hi @Magalame , I don't have time to look at this right now. Could you see what test is broken and perhaps submit a patch for it?
from dh-core.
Sounds good!
from dh-core.
So, looking at the source code, the problem is this line : update hot cold
in https://github.com/DataHaskell/dh-core/blob/master/analyze/src/Analyze/Ops.hs
It leads to the one-hot encoding, and the other columns being concatenated together. Is this the actual purpose of one-hot, or should we output the one-hot encoding only?
from dh-core.
Bump
from dh-core.
Addressed in #53 . Thank you @Magalame !
from dh-core.
Thank you for your help!
from dh-core.
NB:
The "speed issue" isn't an issue. Two things are at play: the inherent slowness of the Gen
erators, and lazyness. The second allowed us to only partially evaluate the frames. So the slow case is the "normal" case, and the faster ones are such just because we don't fully evaluate the frames. It's easily checkable with playing with mersenne generators, and deepseq.
from dh-core.
from dh-core.
I came up with this so far. I just took two tests from the property test suit. benchEmpty
needs to fully evaluate the data, while benchFixtures
does not
module Main where
import qualified Analyze as A
import Analyze.RFrame (RFrameUpdate (..))
import qualified Data.Text as T
import Data.Text (Text)
import qualified Data.Vector as V
import Data.Vector (Vector)
import System.Random
import Control.Monad
import Control.DeepSeq
import qualified System.Random.MWC as M
import qualified Criterion.Main as C
n :: Int
n = 1000
testKeys :: IO (Vector Text)
testKeys = V.replicateM n $ liftM (T.pack . take 10 . randomRs ('a','z')) newStdGen
testData :: IO (Vector (Vector Double))
testData = V.replicateM n $ liftM (V.fromList . take n . randomRs (-1,1)) newStdGen
testDataMersenne :: IO (Vector (Vector Double))
testDataMersenne = do
gen <- M.create
V.replicateM n $ M.uniformVector gen n
benchEmpty :: IO (Vector Text) -> IO (Vector (Vector Double)) -> IO Bool
benchEmpty keysgen datagen = do
keysb <- keysgen
datab <- datagen
let
update = RFrameUpdate keysb datab
expected <- A.fromUpdate update
let
lengthEmpty = length $ A._rframeUpdateKeys update
emptyUpdate = RFrameUpdate V.empty (V.replicate lengthEmpty V.empty)
empty <- A.fromUpdate emptyUpdate
actual <- A.update update empty
return $ actual == expected
benchFixture :: IO (Vector Text) -> IO (Vector (Vector Double)) -> IO Bool
benchFixture keysgen datagen = do
keysb <- keysgen
datab <- datagen
let
update = RFrameUpdate keysb datab
frame <- A.fromUpdate update
let
-- get keys from both the update and the frame
keys = A._rframeKeys frame
keysUp = A._rframeUpdateKeys update
-- gets data from both
nbRows = A.numRows frame
nbRowsUp = length $ A._rframeUpdateData update
-- number of colums from both
nbCols = A.numCols frame
nbColsUp = length $ A._rframeUpdateKeys update
-- checks everything is the same for both
return $ (keys == keysUp) && (nbRows == nbRowsUp) && (nbCols == nbColsUp)
main :: IO()
main = C.defaultMain [ C.bgroup "Tests" [ C.bench "empty" $ C.whnfIO (benchEmpty testKeys testData)
, C.bench "fixture" $ C.whnfIO (benchFixture testKeys testData)
, C.bench "forced fixture" $ C.whnfIO (benchFixture testKeys (fmap force testData))
, C.bench "mersenne empty" $ C.whnfIO (benchEmpty testKeys testDataMersenne)]]
It gives:
benchmarking Tests/empty
time 1.363 s (1.060 s .. 1.800 s)
0.989 R² (0.968 R² .. 1.000 R²)
mean 1.227 s (1.143 s .. 1.300 s)
std dev 103.1 ms (53.77 ms .. 131.9 ms)
variance introduced by outliers: 22% (moderately inflated)
benchmarking Tests/fixture
time 4.098 ms (3.607 ms .. 4.527 ms)
0.899 R² (0.844 R² .. 0.938 R²)
mean 5.786 ms (5.411 ms .. 6.160 ms)
std dev 1.137 ms (958.6 μs .. 1.407 ms)
variance introduced by outliers: 87% (severely inflated)
benchmarking Tests/forced fixture
time 1.182 s (1.129 s .. 1.259 s)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.221 s (1.201 s .. 1.233 s)
std dev 20.13 ms (7.737 ms .. 27.70 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking Tests/mersenne empty
time 86.12 ms (83.85 ms .. 88.79 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 93.49 ms (90.28 ms .. 99.31 ms)
std dev 6.993 ms (2.581 ms .. 9.141 ms)
variance introduced by outliers: 19% (moderately inflated)
from dh-core.
Related Issues (20)
- datasets: add unit tests HOT 7
- datasets : split off datasets-core
- analyze : evaluate `streaming` for RFrame HOT 9
- datasets : add ARFF format HOT 10
- datasets: remove data-default-class HOT 1
- datasets: Add fashion-mnist
- Cross validation layer HOT 1
- Bump Stackage to latest Nightly
- datasets: fix benchmark dataset folder HOT 1
- Add test coverage HOT 1
- Cannot build project on macOS 10.14.3 HOT 4
- dense-linear-algebra : Add chronos-bench benchmarks HOT 4
- dense-linear-algebra : Weird memory and runtime behavior from `generateSym` HOT 2
- dense-linear-algebra: Getting stream fusion to work across `Matrix`'s HOT 5
- Add QMNIST
- dense-linear-algebra: Add support for SIMD instructions
- BostonHousing data set URL needs to be updated. HOT 1
- Cut new release
- Move CI from Travis to Github Actions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dh-core.