Giter VIP home page Giter VIP logo

Comments (4)

ClementMayer avatar ClementMayer commented on June 25, 2024

Work is in progress on contributivity workgroup (see dedicated repository). A library is being set up, already offering 8 to 9 different calculation methods, but no findings have been shared at this stage.
The working group on contributivity will start thinking about a potential integration in the framework and about the right layer to integrate the different works done. In particular, one possibility to explore would be on the generation of compute plans.

from substra-documentation.

bowni avatar bowni commented on June 25, 2024

Ok, so here we go for some thoughts on that one:

  1. What we call "contributivity" is a measurement of how much different partners and their respective datasets have contributed to the performance of a model that has been distributedly trained on these datasets
  2. The baseline measurement approach is the computation of the Shapley value for each dataset, assuming the characteristic function attributes to a given set of partners the performance of the model trained on the datasets of these partners. It has all the good properties, however it is not viable computation-wise beyond 3 or 4 partners (exponential). As such, most approaches are in fact approximations of Shapley. They all rely on being able to compute the performance of the model trained on subsets of the partners and their datasets (see dedicated MPLC project for more information)
    • Note 1: the Shapley approximations studied this far are all based on sampling and optimizing the sampling of certain subsets. They are relevant in use cases with significant numbers of partners (e.g. x100+). In our typical cases with a few to up to a few tens of partners, they aren't really applicable
    • Note 2: on the other hand, Shapley is OK up to 3 or 4 partners. So, we kind of have a gap for 4 to 100 partners
  3. Other types of approaches, not based on Shapley, are being contemplated and tested, including approaches based on:
    • model ensembling (also Shapley-based, but with a different characteristic function which doesn't require additional intensive model training, but many model aggregations instead)
    • measuring performance increments after each local minibatch
    • reinforcement learning (DVRL / PVRL)
  4. If we now explore what it would mean to integrate some of this with Substra:
    1. For Shapley:
      • We need to be able to train models on subsets of the available partners, test them to get performance figures, and in the end do some computations on all those performance figures. This could accomplished "manually", meaning registering dedicated compute plans for each model training and testing, fetching the performance figures, and doing our final computations of the Shapley value manually
      • To go one step further we could actually design a large complete compute plan, encompassing all models training and testing that are needed. Since we know from the start all those tasks, it should be possible
      • Ultimately to really automate it, a final task in the large compute plan could be a traintuple that executes the computation of the Shapley values, using as inputs the performance figures from the past testtuples of the compute plan. This would not require training on a dataset, so this would be an "edge use" of a traintuple, just executing a script.
    2. For the federated step-by-step increments:
      • In this approach, after each local train on a mini-batch, we do a performance test and save the performance figure. After the whole training is completed, we take all those performance figures and compile them
      • So the high-level analysis of how to spin this on a multi-partner Substra setup is similar to Shapley above: we would need to carefully design a large compute plan encompassing all train and test tasks, and in the end have a final task to launch the script that compiles all performance figures in a synthetic contributivity values
    3. For model ensembling:
      • Here again it should be similar. The only specific aspect is that we need to be able to aggregate individually trained models
    4. For DVRL / PVRL: requires more work, these approaches are complex and I don't properly understand yet how they work

from substra-documentation.

ClementMayer avatar ClementMayer commented on June 25, 2024

Update from last MAP committee (10/09/2020):

  • Work is in progress on contributivity workgroup. The package already offers 8 to 9 different calculation methods. No large benchmark done and no findings shared yet
  • It seems possible to design a large compute plan with all the train and test tasks needed to run certain contributivity measurement methods
  • Interesting next steps:
    • quick review of first thoughts by Camille to see if Eric’s first thoughts are coherent
    • Include in Arthur’s internship roadmap to work on a “PoC” of a simple contributivity measurement approach on Substra
    • Identify another dataset in order to have other data / thoughts
  • You wanna join the discussion? You can join the public channel Slack #workgroup-mpl-contributivity and participate in the discussions and the dedicated repos.

from substra-documentation.

RomainGoussault avatar RomainGoussault commented on June 25, 2024

Closing stale issue.

from substra-documentation.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.