Comments (4)
The only dependency on Execution (or anything Scalding specific in general) is in the com.stripe.brushfire.scalding
package, which is to say, really, the Trainer
class. Although I realize there is probably some small amount of Trainer
that could be generalized and reused for a Spark implementation, I think the first step is just to build a completely new Trainer
for Spark, which uses Spark idioms, and then see how similar they actually are.
cc @non who was also looking into this...
from brushfire.
(But if there is anything specific I can explain about how the Scalding version works I'd be happy to do so)
from brushfire.
@avibryant I was referring to the scalding
package. The rest is very general. Also, Scalding and Spark are very interchangeable, but unfortunately there is no Spark equivalent for Scalding's new Execution
feature (which is awesome by the way).
I would like to try to build a new Trainer
class for Spark, but unfortunately, I am not sure I follow how the Scalding version is evaluated. What happens after what and what happens in parallel. But maybe if we can go over the code, I can translate it and then we can even have nice benchmarks to compare.
from brushfire.
@vitalyg I'd be happy to go over it with you, maybe over IRC or something next week some time?
The simplest thing to start with is updateTargets, which is used for constructing the root node of an empty tree, and can also be used to update the leaf distributions for an existing tree from new training data.
The idea here is that you pass over the training data once:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L76
For each tree we're building, we find out how many times to include this instance in that tree:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L79
Then, that many times, we find the leaf corresponding to that instance in that tree, and we emit a key -> value pair which is (treeIndex, leafIndex) -> instance target:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L79
Then, we can, in parallel, sum up all of those values.
Then we group just by key to bring together all of the summed targets for a tree, by leafIndex:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L84
Then we (in parallel, but only with as much parallelism as we have trees) modify the trees to have the new targets:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L92
At the end we write out the new trees.
from brushfire.
Related Issues (20)
- Make brushfire-core and brushfire-scalding separate maven modules
- error-minimizing pruner HOT 3
- per-node error output
- output Trainer metadata
- Make it easy to train/test on exponentially-decayed weights
- Add some property-based tests HOT 1
- The resolve-pom-maven-plugin is causing shade to be ignored HOT 1
- Add new `sumLeaves` method (or similar) HOT 1
- Add new `SparseEqualTo[V]` predicate HOT 3
- Ensure current Predicates return true for missing data
- Voter refactor to work with sumLeaves
- Multi-label targets HOT 1
- separate brushfire-tree from brushfire-training HOT 2
- Include Ordering[E] in Error[T, P, E]
- use TreeTraversal in training
- README points to nonexistent artifacts
- QTreeSplitter should support negative numbers HOT 1
- Add an Encoder to the pipeline HOT 3
- Export `TreeGenerator` and maybe some tests
- Brushfire should use Bonsai trees in training
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from brushfire.