Giter VIP home page Giter VIP logo

Comments (17)

ocramz avatar ocramz commented on May 25, 2024 1

@Magalame yep!

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

I think I'd like to try that one too.
If I understand properly, the idea is to replace every unit test by a property test? So in short, writing the Arbitrary instances, and then the actual test part?

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

@ocramz So I started writing the tests (it's a bit messy, I'll organise it more cleanly later): https://github.com/Magalame/dh-core/blob/replace-fixtures/analyze/test/Spec2.hs

It's my first time really experimenting with QuickCheck and the like, so I was wondering if looked good so far?

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

Almost completely set up, only missing the oneHot part. I think I might be doing something wrong because the performance of some tests might seem a bit poor, with a x15 slowdown.

from dh-core.

ocramz avatar ocramz commented on May 25, 2024

@Magalame um, that sounds strange. Which tests run so slowly?

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

When running them, I get that:

Test suite
  Fixture:        OK (0.10s)
    +++ OK, passed 100 tests.
  Row Decode:     OK (0.15s)
    +++ OK, passed 100 tests.
  Drop:           OK (0.11s)
    +++ OK, passed 100 tests.
  Keep:           OK (0.10s)
    +++ OK, passed 100 tests.
  Update Empty:   OK (1.83s) #all these going down
    +++ OK, passed 100 tests.
  Update Empty 2: OK (1.67s)
    +++ OK, passed 100 tests.
  Update Add:     OK (1.72s)
    +++ OK, passed 100 tests.
  Update Overlap: OK (1.79s)
    +++ OK, passed 100 tests.
  Take Rows:      OK (0.95s)
    +++ OK, passed 100 tests.
  Add Column:     OK (2.01s)
    +++ OK, passed 100 tests.

All 10 tests passed (10.43s)

I tried to profile it, here's the .prof: https://ufile.io/1lqia . However it looks a bit cryptic to me.

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

I think I'm having trouble dealing with oneHot. Here is an example of what I obtain with stack runghc Spec2.hs:

-----------------------
Original Update: RFrameUpdate {_rframeUpdateKeys = ["","zlh"], _rframeUpdateData = [[ValueDouble (-2.0136650956085296),ValueText "rs"],[ValueDouble 0.19461704868628563,ValueText "mx"]]}
-----
Original data: [[ValueDouble (-2.0136650956085296),ValueText "rs"],[ValueDouble 0.19461704868628563,ValueText "mx"]]
-----
Key to test: ""
-----
True/false value: ValueDouble 0.19461704868628563/ValueDouble (-2.0136650956085296)
-----
Expected result: RFrame {_rframeKeys = ["ValueDouble (-2.0136650956085296)","ValueDouble 0.19461704868628563"], _rframeLookup = fromList [("ValueDouble 0.19461704868628563",1),("ValueDouble (-2.0136650956085296)",0)], _rframeData = [[ValueDouble 0.19461704868628563,ValueDouble (-2.0136650956085296)],[ValueDouble (-2.0136650956085296),ValueDouble 0.19461704868628563]]}
-----
From A.oneHot: RFrame {_rframeKeys = ["zlh","ValueDouble (-2.0136650956085296)","ValueDouble 0.19461704868628563"], _rframeLookup = fromList [("ValueDouble 0.19461704868628563",2),("ValueDouble (-2.0136650956085296)",1),("zlh",0)], _rframeData = [[ValueText "rs",ValueDouble 0.19461704868628563,ValueDouble (-2.0136650956085296)],[ValueText "mx",ValueDouble (-2.0136650956085296),ValueDouble 0.19461704868628563]]}

There seems to be an extra "zlh" column.

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

Bump

from dh-core.

ocramz avatar ocramz commented on May 25, 2024

Hi @Magalame , I don't have time to look at this right now. Could you see what test is broken and perhaps submit a patch for it?

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

Sounds good!

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

So, looking at the source code, the problem is this line : update hot cold in https://github.com/DataHaskell/dh-core/blob/master/analyze/src/Analyze/Ops.hs
It leads to the one-hot encoding, and the other columns being concatenated together. Is this the actual purpose of one-hot, or should we output the one-hot encoding only?

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

Bump

from dh-core.

ocramz avatar ocramz commented on May 25, 2024

Addressed in #53 . Thank you @Magalame !

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

Thank you for your help!

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

NB:
The "speed issue" isn't an issue. Two things are at play: the inherent slowness of the Generators, and lazyness. The second allowed us to only partially evaluate the frames. So the slow case is the "normal" case, and the faster ones are such just because we don't fully evaluate the frames. It's easily checkable with playing with mersenne generators, and deepseq.

from dh-core.

ocramz avatar ocramz commented on May 25, 2024

from dh-core.

Magalame avatar Magalame commented on May 25, 2024

I came up with this so far. I just took two tests from the property test suit. benchEmpty needs to fully evaluate the data, while benchFixtures does not

module Main where

import qualified Analyze                  as A
import Analyze.RFrame (RFrameUpdate (..))

import qualified Data.Text           as T
import           Data.Text           (Text)

import qualified Data.Vector         as V
import           Data.Vector         (Vector)

import System.Random
import Control.Monad
import Control.DeepSeq

import qualified System.Random.MWC as M

import qualified Criterion.Main as C

n :: Int
n = 1000


testKeys ::  IO (Vector Text)
testKeys =  V.replicateM n $ liftM (T.pack . take 10 . randomRs ('a','z')) newStdGen

testData ::  IO (Vector (Vector Double))
testData =  V.replicateM n $ liftM (V.fromList . take n . randomRs (-1,1)) newStdGen

testDataMersenne :: IO (Vector (Vector Double))
testDataMersenne = do 
    gen <- M.create
    V.replicateM n $ M.uniformVector gen n


benchEmpty :: IO (Vector Text) -> IO (Vector (Vector Double)) -> IO Bool
benchEmpty keysgen datagen = do 

    keysb <- keysgen
    datab <- datagen

    let 
      update = RFrameUpdate keysb datab


    expected <- A.fromUpdate update
    
    let
      lengthEmpty = length $ A._rframeUpdateKeys update
      emptyUpdate = RFrameUpdate V.empty (V.replicate lengthEmpty V.empty)
    
    empty <- A.fromUpdate emptyUpdate
    actual <- A.update update empty
            
    return $ actual == expected


benchFixture :: IO (Vector Text) -> IO (Vector (Vector Double)) -> IO Bool
benchFixture keysgen datagen = do 

                   keysb <- keysgen
                   datab <- datagen

                   let
                     update = RFrameUpdate keysb datab


                   frame <- A.fromUpdate update 
                   let
                      -- get keys from both the update and the frame
                      keys = A._rframeKeys frame
                      keysUp = A._rframeUpdateKeys update 

                      -- gets data from both
                      nbRows = A.numRows frame
                      nbRowsUp = length $ A._rframeUpdateData update

                      -- number of colums from both
                      nbCols = A.numCols frame
                      nbColsUp = length $ A._rframeUpdateKeys update

                   -- checks everything is the same for both 
                   return $ (keys == keysUp) && (nbRows == nbRowsUp) && (nbCols == nbColsUp)


main :: IO()
main = C.defaultMain [ C.bgroup "Tests" [ C.bench "empty"   $ C.whnfIO (benchEmpty testKeys testData)           
                                        , C.bench "fixture" $ C.whnfIO (benchFixture testKeys testData)
                                        , C.bench "forced fixture"  $ C.whnfIO (benchFixture testKeys (fmap force testData))
                                        , C.bench "mersenne empty" $ C.whnfIO (benchEmpty testKeys testDataMersenne)]]

It gives:

benchmarking Tests/empty
time                 1.363 s    (1.060 s .. 1.800 s)
                     0.989 R²   (0.968 R² .. 1.000 R²)
mean                 1.227 s    (1.143 s .. 1.300 s)
std dev              103.1 ms   (53.77 ms .. 131.9 ms)
variance introduced by outliers: 22% (moderately inflated)

benchmarking Tests/fixture
time                 4.098 ms   (3.607 ms .. 4.527 ms)
                     0.899 R²   (0.844 R² .. 0.938 R²)
mean                 5.786 ms   (5.411 ms .. 6.160 ms)
std dev              1.137 ms   (958.6 μs .. 1.407 ms)
variance introduced by outliers: 87% (severely inflated)

benchmarking Tests/forced fixture
time                 1.182 s    (1.129 s .. 1.259 s)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.221 s    (1.201 s .. 1.233 s)
std dev              20.13 ms   (7.737 ms .. 27.70 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking Tests/mersenne empty
time                 86.12 ms   (83.85 ms .. 88.79 ms)
                     0.998 R²   (0.995 R² .. 1.000 R²)
mean                 93.49 ms   (90.28 ms .. 99.31 ms)
std dev              6.993 ms   (2.581 ms .. 9.141 ms)
variance introduced by outliers: 19% (moderately inflated)

from dh-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.