pierrenodet / spark-ensemble Goto Github PK
View Code? Open in Web Editor NEWEnsemble Learning for Apache Spark ๐ฒ
Home Page: https://pierrenodet.github.io/spark-ensemble/
License: Apache License 2.0
Ensemble Learning for Apache Spark ๐ฒ
Home Page: https://pierrenodet.github.io/spark-ensemble/
License: Apache License 2.0
At the moment, an udf is used to slice the input features vector, but the metadata are not modified so it's messed up.
VectorSlicer can be used in training to correctly handle metadata, for predicting, the udf seems enough.
Iterator.fill and flatMap seems ok but creating a PoissonDistribution on every row seems expensive
At the moment, Bagging and Boosting (not Stacking and GBM) are using rdd in order to batch data or compute current prediction error and a new dataset is created to be passed to the weak learner.
The issue is that the metadata is not passed and is lost. That's an issue for datasets that relies on them.
There is a missing check if at a given iteration the maxError is 0, it's used later in a division and makes null values.
The algorithm should stop or take into account the division by 0.
Be sure we have the same level of comments and documentation as required in MLlib
Refractoring the hierarchy of params would be nice,
Refractoring the common code between classifier and regressor of the same kind and ensemble methods of the same kind in traits/objects would be neet !
At the moment the tests are only "end to end" tests, unit tests are needed !
Is this two spark sql functions combined really efficient for the sampling method ?
Maybe we can just do an iterator.fill in a flatMap ?
At the moment we split the sampling phase in two so we can keep track of which instance is in which bag, using df.sample if we never compute oob would be way better !
At the moment it's not clear how to pass params for weeks learners when doing cross validation.
You should provide an array of classifiers with different params which is not convenient.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.