Giter VIP home page Giter VIP logo

Comments (4)

avibryant avatar avibryant commented on September 18, 2024

The only dependency on Execution (or anything Scalding specific in general) is in the com.stripe.brushfire.scalding package, which is to say, really, the Trainer class. Although I realize there is probably some small amount of Trainer that could be generalized and reused for a Spark implementation, I think the first step is just to build a completely new Trainer for Spark, which uses Spark idioms, and then see how similar they actually are.

cc @non who was also looking into this...

from brushfire.

avibryant avatar avibryant commented on September 18, 2024

(But if there is anything specific I can explain about how the Scalding version works I'd be happy to do so)

from brushfire.

vitalyg avatar vitalyg commented on September 18, 2024

@avibryant I was referring to the scalding package. The rest is very general. Also, Scalding and Spark are very interchangeable, but unfortunately there is no Spark equivalent for Scalding's new Execution feature (which is awesome by the way).

I would like to try to build a new Trainer class for Spark, but unfortunately, I am not sure I follow how the Scalding version is evaluated. What happens after what and what happens in parallel. But maybe if we can go over the code, I can translate it and then we can even have nice benchmarks to compare.

from brushfire.

avibryant avatar avibryant commented on September 18, 2024

@vitalyg I'd be happy to go over it with you, maybe over IRC or something next week some time?

The simplest thing to start with is updateTargets, which is used for constructing the root node of an empty tree, and can also be used to update the leaf distributions for an existing tree from new training data.

The idea here is that you pass over the training data once:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L76

For each tree we're building, we find out how many times to include this instance in that tree:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L79

Then, that many times, we find the leaf corresponding to that instance in that tree, and we emit a key -> value pair which is (treeIndex, leafIndex) -> instance target:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L79

Then, we can, in parallel, sum up all of those values.

Then we group just by key to bring together all of the summed targets for a tree, by leafIndex:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L84

Then we (in parallel, but only with as much parallelism as we have trees) modify the trees to have the new targets:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L92

At the end we write out the new trees.

from brushfire.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.